1 00:00:11,374 --> 00:00:14,567 - Hello everyone, welcome to CS231. 2 00:00:14,567 --> 00:00:17,618 I'm Song Han. Today I'm going to give a guest lecture 3 00:00:17,618 --> 00:00:21,468 on the efficient methods and hardware for deep learning. 4 00:00:21,468 --> 00:00:24,714 So I'm a fifth year PhD candidate here at Stanford, 5 00:00:24,714 --> 00:00:28,081 advised by Professor Bill Dally. 6 00:00:28,081 --> 00:00:31,093 So, in this course we have seen a lot of convolution neural 7 00:00:31,093 --> 00:00:33,932 networks, recurrent neural networks, or even 8 00:00:33,932 --> 00:00:37,358 since last time, the reinforcement learning. 9 00:00:37,358 --> 00:00:39,281 They are spanning a lot of applications. 10 00:00:39,281 --> 00:00:41,979 For example, the self-=driving car, machine translation, 11 00:00:41,979 --> 00:00:44,157 AlphaGo and Smart Robots. 12 00:00:44,157 --> 00:00:46,904 And it's changing our lives, but there is a recent 13 00:00:46,904 --> 00:00:50,781 trend that in order to achieve such high accuracy, 14 00:00:50,781 --> 00:00:53,652 the models are getting larger and larger. 15 00:00:53,652 --> 00:00:56,669 For example for ImageNet recognition, the winner from 16 00:00:56,669 --> 00:01:00,502 2012 to 2015, the model size increased by 16X. 17 00:01:02,519 --> 00:01:05,104 And just in one year, for Baidu's deep speech 18 00:01:05,104 --> 00:01:07,809 just in one year, the training operations, the number 19 00:01:07,809 --> 00:01:11,142 of training operations increased by 10X. 20 00:01:12,043 --> 00:01:15,651 So such large model creates lots of problems, 21 00:01:15,651 --> 00:01:18,941 for example the model size becomes larger and larger 22 00:01:18,941 --> 00:01:22,413 so it's difficult for them to be deployed either 23 00:01:22,413 --> 00:01:25,159 on those for example, on the mobile phones. 24 00:01:25,159 --> 00:01:28,232 If the item is larger than 100 megabytes, you 25 00:01:28,232 --> 00:01:30,797 cannot download until you connect to Wi-Fi. 26 00:01:30,797 --> 00:01:33,315 So those product managers and for example Baidu, 27 00:01:33,315 --> 00:01:36,280 Facebook, they are very sensitive to the size of the binary 28 00:01:36,280 --> 00:01:37,982 size of their model. 29 00:01:37,982 --> 00:01:40,358 And also for example, the self-driving car, you can only 30 00:01:40,358 --> 00:01:43,743 do those on over-the-air update for the model 31 00:01:43,743 --> 00:01:47,130 if the model is too large, it's also difficult. 32 00:01:47,130 --> 00:01:51,958 And the second challenge for those large models is 33 00:01:51,958 --> 00:01:55,272 that the training speed is extremely slow. 34 00:01:55,272 --> 00:01:58,930 For example, the ResNet152, which is only a few, less 35 00:01:58,930 --> 00:02:03,713 than 1% actually, more accurate than ResNet101. 36 00:02:03,713 --> 00:02:07,046 Takes 1.5 weeks to train on four Maxwell 37 00:02:08,839 --> 00:02:10,589 M40 GPUs for example. 38 00:02:11,703 --> 00:02:15,175 Which greatly limits either we are doing homework 39 00:02:15,175 --> 00:02:17,422 or if the researcher's designing new models is 40 00:02:17,422 --> 00:02:19,284 getting pretty slow. 41 00:02:19,284 --> 00:02:22,473 And the third challenge for those bulky model is 42 00:02:22,473 --> 00:02:24,377 the energy efficiency. 43 00:02:24,377 --> 00:02:27,730 For example, the AlphaGo beating Lee Sedol last year, 44 00:02:27,730 --> 00:02:31,563 took 2000 CPUs and 300 GPUs, which cost $3,000 45 00:02:33,090 --> 00:02:37,527 just to pay for the electric bill, which is insane. 46 00:02:37,527 --> 00:02:39,968 So either on those embedded devices, those models 47 00:02:39,968 --> 00:02:43,100 are draining your battery power for on data-center 48 00:02:43,100 --> 00:02:46,548 increases the total cost of ownership of maintaining 49 00:02:46,548 --> 00:02:48,215 a large data-center. 50 00:02:49,250 --> 00:02:51,678 For example, Google in their blog, they mentioned 51 00:02:51,678 --> 00:02:55,118 if all the users using the Google Voice Search for 52 00:02:55,118 --> 00:02:58,592 just three minutes, they have to double their data-center. 53 00:02:58,592 --> 00:03:00,509 So that's a large cost. 54 00:03:01,766 --> 00:03:04,802 So reducing such cost is very important. 55 00:03:04,802 --> 00:03:08,356 And let's see where is actually the energy consumed. 56 00:03:08,356 --> 00:03:11,024 The large model means lots of memory access. 57 00:03:11,024 --> 00:03:14,060 You have to access, load those models from the memory 58 00:03:14,060 --> 00:03:15,869 means more energy. 59 00:03:15,869 --> 00:03:19,541 If you look at how much energy is consumed by loading 60 00:03:19,541 --> 00:03:23,708 the memory versus how much is consumed by multiplications 61 00:03:24,852 --> 00:03:29,717 and add those arithmetic operations, the memory access 62 00:03:29,717 --> 00:03:33,550 is more than two or three orders of magnitude, 63 00:03:34,579 --> 00:03:38,746 more energy consuming than those arithmetic operations. 64 00:03:40,191 --> 00:03:43,996 So how to make deep learning more efficient. 65 00:03:43,996 --> 00:03:47,102 So we have to improve energy efficiency by this 66 00:03:47,102 --> 00:03:49,852 Algorithm and Hardware Co-Design. 67 00:03:50,700 --> 00:03:53,090 So this is the previous way, which is our hardware. 68 00:03:53,090 --> 00:03:57,257 For example, we have some benchmarks say Spec 2006 69 00:03:58,510 --> 00:04:01,039 and then run those benchmarks and tune your CPU 70 00:04:01,039 --> 00:04:03,956 architectures for those benchmarks. 71 00:04:06,015 --> 00:04:08,823 Now what we should do is to open up the box to see 72 00:04:08,823 --> 00:04:11,620 what can we do from algorithm side first and see what 73 00:04:11,620 --> 00:04:15,375 is the optimum question mark processing unit. 74 00:04:15,375 --> 00:04:18,733 That breaks the boundary between the algorithm 75 00:04:18,733 --> 00:04:22,316 hardware to improve the overall efficiency. 76 00:04:26,017 --> 00:04:29,779 So today's talk, I'm going to have the following agenda. 77 00:04:29,779 --> 00:04:33,910 We are going to cover four aspects: The algorithm hardware 78 00:04:33,910 --> 00:04:36,071 and inference and training. 79 00:04:36,071 --> 00:04:40,817 So they form a small two by two matrix, so includes the 80 00:04:40,817 --> 00:04:43,138 algorithm for efficient inference, 81 00:04:43,138 --> 00:04:45,291 hardware for efficient inference 82 00:04:45,291 --> 00:04:47,581 and the algorithm for efficient training, 83 00:04:47,581 --> 00:04:50,976 and lastly, the hardware for efficient training. 84 00:04:50,976 --> 00:04:53,125 For example, I'm going to cover the TPU, I'm 85 00:04:53,125 --> 00:04:54,609 going to cover the Volta. 86 00:04:54,609 --> 00:04:58,692 But before I cover those things, let's have three 87 00:04:59,741 --> 00:05:02,443 slides for Hardware 101. 88 00:05:02,443 --> 00:05:05,180 A brief introduction of the families of hardware 89 00:05:05,180 --> 00:05:06,430 in such a tree. 90 00:05:07,355 --> 00:05:11,955 So in general, we can have roughly two branches. 91 00:05:11,955 --> 00:05:14,761 One is general purpose hardware. 92 00:05:14,761 --> 00:05:18,844 It can do any applications versus the specialized 93 00:05:21,324 --> 00:05:25,249 hardware, which is tuned for a specific kind of 94 00:05:25,249 --> 00:05:29,113 applications, a domain of applications. 95 00:05:29,113 --> 00:05:31,962 So the general purpose hardware includes, the CPU 96 00:05:31,962 --> 00:05:35,621 or the GPU, and their difference is that CPU is 97 00:05:35,621 --> 00:05:38,288 latency oriented, single threaded. 98 00:05:38,288 --> 00:05:40,451 It's like a big elephant. 99 00:05:40,451 --> 00:05:43,534 While the GPU is throughput oriented. 100 00:05:44,486 --> 00:05:46,846 It has many small though weak threads, but there 101 00:05:46,846 --> 00:05:49,691 are thousands of such small weak cores. 102 00:05:49,691 --> 00:05:54,088 Like a group of small ants, where there are so many ants. 103 00:05:54,088 --> 00:05:58,255 And specialized hardware, roughly there are FPGAs and ASICs. 104 00:05:59,126 --> 00:06:03,274 So FPGA stand for Field Programmable Gate Array. 105 00:06:03,274 --> 00:06:07,748 So it is programmable, hardware programmable so its 106 00:06:07,748 --> 00:06:09,353 logic can be changed. 107 00:06:09,353 --> 00:06:13,520 So it's cheaper for you to try new ideas and do prototype, 108 00:06:14,597 --> 00:06:16,262 but it's less efficient. 109 00:06:16,262 --> 00:06:18,185 It's in the middle between the general purpose and 110 00:06:18,185 --> 00:06:19,018 pure ASIC. 111 00:06:19,965 --> 00:06:24,137 So ASIC stands for Application Specific Integrated Circuit. 112 00:06:24,137 --> 00:06:25,842 It has a fixed logic, just designed 113 00:06:25,842 --> 00:06:27,293 for a certain application. 114 00:06:27,293 --> 00:06:29,341 For example deep learning. 115 00:06:29,341 --> 00:06:34,264 And Google's TPU is a kind of ASIC and the neural networks 116 00:06:34,264 --> 00:06:37,852 we train on, the earlier GPUs is here. 117 00:06:37,852 --> 00:06:41,645 And another slide for Hardware 101 is the number 118 00:06:41,645 --> 00:06:43,657 representations. 119 00:06:43,657 --> 00:06:47,473 So in this slide, I'm going to convey you the idea that 120 00:06:47,473 --> 00:06:49,924 all the numbers in computer are not represented 121 00:06:49,924 --> 00:06:51,742 by a real number. 122 00:06:51,742 --> 00:06:54,536 It's not a real number, but they are actually discrete. 123 00:06:54,536 --> 00:06:57,977 Even for those floating point with your 32 Bit. 124 00:06:57,977 --> 00:07:02,301 Floating point numbers, their resolution is not perfect. 125 00:07:02,301 --> 00:07:06,336 It's not continuous, but it's discrete. 126 00:07:06,336 --> 00:07:10,271 So for example FP32, meaning using a 32 bit to represent 127 00:07:10,271 --> 00:07:12,147 a floating point number. 128 00:07:12,147 --> 00:07:15,296 So there are three components in the representation. 129 00:07:15,296 --> 00:07:18,907 The sign bit, the exponent bit, the mantissa, 130 00:07:18,907 --> 00:07:23,682 and the number it represents is shown by minus 1 to the S 131 00:07:23,682 --> 00:07:26,515 times 1.M times 2 to the exponent. 132 00:07:28,778 --> 00:07:32,745 So similar there is FP16, using a 16 bit to represent 133 00:07:32,745 --> 00:07:34,745 a floating point number. 134 00:07:36,616 --> 00:07:39,375 In particular, I'm going to introduce Int8, where 135 00:07:39,375 --> 00:07:43,692 the core TPU use, using an integer to represent a fixed 136 00:07:43,692 --> 00:07:44,863 point number. 137 00:07:44,863 --> 00:07:47,912 So we have a certain number of bits for the integer. 138 00:07:47,912 --> 00:07:50,827 Followed by a radix point, if we put different layers. 139 00:07:50,827 --> 00:07:54,255 And lastly, the fractional bits. 140 00:07:54,255 --> 00:07:58,088 So why do we prefer those eight bit, or 16 bit 141 00:07:59,257 --> 00:08:01,502 rather than those traditional like the 142 00:08:01,502 --> 00:08:03,844 32 bit floating point. 143 00:08:03,844 --> 00:08:04,856 That's the cost. 144 00:08:04,856 --> 00:08:08,981 So, I generated the figure from 45 nanometer technology 145 00:08:08,981 --> 00:08:13,189 about the energy cost versus the area cost for different 146 00:08:13,189 --> 00:08:14,635 operations. 147 00:08:14,635 --> 00:08:18,709 In particular, let's see here, go you from 32 bit to 148 00:08:18,709 --> 00:08:22,876 16 bit, we have about four times reduction in energy 149 00:08:24,066 --> 00:08:28,783 and also about four times reduction in the area. 150 00:08:28,783 --> 00:08:30,966 Area means money. 151 00:08:30,966 --> 00:08:33,751 Every millimeter square takes money to take out a chip 152 00:08:33,751 --> 00:08:38,592 So it's very beneficial for hardware design to go from 153 00:08:38,592 --> 00:08:40,009 32 bit to 16 bit. 154 00:08:41,801 --> 00:08:45,968 That's why you hear NVIDIA from Pascal Architecture, 155 00:08:46,894 --> 00:08:49,821 they said they're starting to support FP16. 156 00:08:49,821 --> 00:08:53,915 That's the reason why it's so beneficial. 157 00:08:53,915 --> 00:08:57,122 For example, previous battery level could last four hours, 158 00:08:57,122 --> 00:08:58,662 now it becomes 16 hours. 159 00:08:58,662 --> 00:09:00,269 That's what it means to reduce 160 00:09:00,269 --> 00:09:02,698 the energy cost by four times. 161 00:09:02,698 --> 00:09:07,160 But here still, there's a problem of large energy costs 162 00:09:07,160 --> 00:09:08,297 for reading the memory. 163 00:09:08,297 --> 00:09:11,771 And let's see how can we deal with this memory reference 164 00:09:11,771 --> 00:09:16,279 so expensive, how do we deal with this problem better? 165 00:09:16,279 --> 00:09:19,913 So let's switch gear and come to our topic directly. 166 00:09:19,913 --> 00:09:24,285 So let's first introduce algorithm for efficient inference. 167 00:09:24,285 --> 00:09:27,919 So I'm going to cover six topics, this is a really long slide. 168 00:09:27,919 --> 00:09:30,336 So I'm going to relatively fast. 169 00:09:31,796 --> 00:09:34,747 So the first idea I'm going to talk about is pruning. 170 00:09:34,747 --> 00:09:36,767 Pruning the neural networks. 171 00:09:36,767 --> 00:09:39,671 For example, this is original neural network. 172 00:09:39,671 --> 00:09:42,927 So what I'm trying to do is, can we remove some of the 173 00:09:42,927 --> 00:09:46,260 weight and still have the same accuracy? 174 00:09:47,424 --> 00:09:49,026 It's like pruning a tree, get rid 175 00:09:49,026 --> 00:09:51,838 of those redundant connections. 176 00:09:51,838 --> 00:09:55,540 This is first proposed by Professor Yann LeCun back in 1989, 177 00:09:55,540 --> 00:09:59,839 and I revisited this problem, 26 years later, on those 178 00:09:59,839 --> 00:10:03,933 modern deep neural nets to see how it works. 179 00:10:03,933 --> 00:10:06,764 So not all parameters are useful actually. 180 00:10:06,764 --> 00:10:09,388 For example, in this case, if you want to fit a single line, 181 00:10:09,388 --> 00:10:12,308 but you're using a quadratic term, apparently the 182 00:10:12,308 --> 00:10:14,808 0.01 is a redundant parameter. 183 00:10:15,977 --> 00:10:18,174 So I'm going to train the connectivity first and then 184 00:10:18,174 --> 00:10:20,611 prune some of the connections. 185 00:10:20,611 --> 00:10:22,384 And then train the remaining weights, 186 00:10:22,384 --> 00:10:24,364 and through this process, it regulates. 187 00:10:24,364 --> 00:10:28,663 And as a result, I can reduce the number of connections, 188 00:10:28,663 --> 00:10:31,908 and annex that from 16 million parameters to only 189 00:10:31,908 --> 00:10:35,278 six million parameters, which is 10 times less 190 00:10:35,278 --> 00:10:36,611 the computation. 191 00:10:37,645 --> 00:10:39,645 So this is the accuracy. 192 00:10:42,842 --> 00:10:46,224 So the x-axis is how much parameters to prune away 193 00:10:46,224 --> 00:10:49,592 and the y-axis is the accuracy you have. 194 00:10:49,592 --> 00:10:53,180 So we want to have less parameters, but we also 195 00:10:53,180 --> 00:10:55,834 want to have the same accuracy as before. 196 00:10:55,834 --> 00:10:58,424 We don't want to sacrifice accuracy, 197 00:10:58,424 --> 00:11:02,591 For example at 80%, we locked zero away left 80% 198 00:11:04,255 --> 00:11:08,257 of the parameters, but accuracy jumped by 4%. 199 00:11:08,257 --> 00:11:10,097 That's intolerable. 200 00:11:10,097 --> 00:11:12,535 But the good thing is that if we retrain the remaining 201 00:11:12,535 --> 00:11:16,285 weights, the accuracy can fully recover here. 202 00:11:18,020 --> 00:11:19,914 And if we do this process iteratively 203 00:11:19,914 --> 00:11:22,997 by pruning and retraining, pruning and retraining, 204 00:11:22,997 --> 00:11:26,938 we can fully recover the accuracy not until we are 205 00:11:26,938 --> 00:11:30,479 prune away 90% of the parameters. 206 00:11:30,479 --> 00:11:34,114 So if you go back to home and try it on your Ipad 207 00:11:34,114 --> 00:11:38,314 or notebook, just zero away 50% of the parameters say 208 00:11:38,314 --> 00:11:41,118 you went on your homework, you will astonishingly find 209 00:11:41,118 --> 00:11:44,118 that accuracy actually doesn't hurt. 210 00:11:45,087 --> 00:11:47,422 So we just mentioned convolution neural nets, 211 00:11:47,422 --> 00:11:52,301 how about RNNs and LSTMs, so I tried with this neural talk. 212 00:11:52,301 --> 00:11:55,637 Again, pruning away 90% of the rates doesn't hurt the 213 00:11:55,637 --> 00:11:56,554 blue score. 214 00:11:58,385 --> 00:12:00,007 And here are some visualizations. 215 00:12:00,007 --> 00:12:04,401 For example, the original picture, the neural talk says 216 00:12:04,401 --> 00:12:07,507 a basketball player in a white uniform is playing 217 00:12:07,507 --> 00:12:08,710 with a ball. 218 00:12:08,710 --> 00:12:12,797 Versus pruning away 90% it says, a basketball player 219 00:12:12,797 --> 00:12:16,775 in a white uniform is playing with a basketball. 220 00:12:16,775 --> 00:12:18,192 And on and so on. 221 00:12:19,155 --> 00:12:23,157 But if you're too aggressive, say you prune away 222 00:12:23,157 --> 00:12:27,324 95% of the weights, the network is going to get drunk. 223 00:12:28,766 --> 00:12:32,355 It says, a man in a red shirt and white and black shirt 224 00:12:32,355 --> 00:12:34,345 is running through a field. 225 00:12:34,345 --> 00:12:37,059 So there's really a limit, a threshold, you have to 226 00:12:37,059 --> 00:12:39,726 take care of during the pruning. 227 00:12:41,095 --> 00:12:43,395 So interestingly, after I did the work, did some 228 00:12:43,395 --> 00:12:45,788 resource and research and find actually the same 229 00:12:45,788 --> 00:12:49,524 pruning procedure actually happens to human brain 230 00:12:49,524 --> 00:12:50,357 as well. 231 00:12:50,357 --> 00:12:54,459 So when we were born, there are about 50 trillion synapses 232 00:12:54,459 --> 00:12:55,688 in the brain. 233 00:12:55,688 --> 00:13:00,162 And at one year old, this number surged into 1,000 trillion. 234 00:13:00,162 --> 00:13:04,329 And as we become adolescent, it becomes smaller actually, 235 00:13:05,201 --> 00:13:09,368 500 trillion in the end, according to the study by Nature. 236 00:13:11,803 --> 00:13:13,459 So this is very interesting. 237 00:13:13,459 --> 00:13:15,966 And also, the pruning changed the weight distribution 238 00:13:15,966 --> 00:13:18,957 because we are removing those small connections 239 00:13:18,957 --> 00:13:22,027 and after we retrain them, that's why it becomes soft 240 00:13:22,027 --> 00:13:22,944 in the end. 241 00:13:23,939 --> 00:13:25,570 Yeah, question. 242 00:13:25,570 --> 00:13:26,781 - [Student] Are you trying to mean that it terms 243 00:13:26,781 --> 00:13:29,901 of your mixed weights during the training will be 244 00:13:29,901 --> 00:13:32,259 just set at zero and just start from scratch? 245 00:13:32,259 --> 00:13:35,386 And these start from the things that are at zero. 246 00:13:35,386 --> 00:13:37,411 - Yeah. So the question is, how do we deal with those 247 00:13:37,411 --> 00:13:39,435 zero connections? 248 00:13:39,435 --> 00:13:43,602 So we force them to be zero in all the other iterations. 249 00:13:45,369 --> 00:13:46,427 Question? 250 00:13:46,427 --> 00:13:50,153 - [Student] How do you pick which rates to drop? 251 00:13:50,153 --> 00:13:53,293 - Yeah so very simple. Small weights, drop it, sort it. 252 00:13:53,293 --> 00:13:54,421 If it's small, just-- 253 00:13:54,421 --> 00:13:55,709 - [Student] Any threshold that I decide? 254 00:13:55,709 --> 00:13:57,042 - Exactly, yeah. 255 00:13:59,058 --> 00:14:01,929 So the next idea, weight sharing. 256 00:14:01,929 --> 00:14:05,574 So now we have, remember our end goal is to remove 257 00:14:05,574 --> 00:14:09,703 connections so that we can have less memory footprint 258 00:14:09,703 --> 00:14:12,567 so that we can have more energy efficient deployment. 259 00:14:12,567 --> 00:14:15,361 Now we have less number of parameters by pruning. 260 00:14:15,361 --> 00:14:19,446 We want to have less number of bits per parameter 261 00:14:19,446 --> 00:14:23,204 so they're multiplied together they get a small model. 262 00:14:23,204 --> 00:14:25,287 So the idea is like this. 263 00:14:26,267 --> 00:14:28,445 Not all numbers, not all the weights 264 00:14:28,445 --> 00:14:30,977 has to be the exact number. 265 00:14:30,977 --> 00:14:35,144 For example, 2.09, 2.12 or all these four weights, you 266 00:14:36,725 --> 00:14:39,867 just put them using 2.0 to represent them. 267 00:14:39,867 --> 00:14:41,278 That's enough. 268 00:14:41,278 --> 00:14:45,445 Otherwise too accurate number is just leads to overfitting. 269 00:14:46,851 --> 00:14:50,227 So the idea is I can cluster the weights if they 270 00:14:50,227 --> 00:14:53,278 are similar, just using a centroid to represent 271 00:14:53,278 --> 00:14:57,558 the number instead of using the full precision weight. 272 00:14:57,558 --> 00:15:01,094 So that every time I do the inference, I just do inference 273 00:15:01,094 --> 00:15:03,417 on this single number. 274 00:15:03,417 --> 00:15:06,995 For example, this is a four by four weight matrix 275 00:15:06,995 --> 00:15:09,027 in a certain layer. 276 00:15:09,027 --> 00:15:12,715 And what I'm going to do is do k-means clustering by having 277 00:15:12,715 --> 00:15:15,496 the similar weight sharing the same centroid. 278 00:15:15,496 --> 00:15:19,364 For example, 2.09, 2.12, I store index of 279 00:15:19,364 --> 00:15:21,987 three pointing to here. 280 00:15:21,987 --> 00:15:25,529 So that, the good thing is we need to only store the 281 00:15:25,529 --> 00:15:29,638 two bit index rather than the 32 bit, floating point number. 282 00:15:29,638 --> 00:15:31,555 That's 16 times saving. 283 00:15:34,577 --> 00:15:37,257 And how do we train such neural network? 284 00:15:37,257 --> 00:15:41,424 They are binded together, so after we get the gradient, 285 00:15:42,372 --> 00:15:45,540 we color them in the same pattern as the weight 286 00:15:45,540 --> 00:15:48,354 and then we do a group by operation by having all 287 00:15:48,354 --> 00:15:52,604 the in that weights with the same index grouped together. 288 00:15:52,604 --> 00:15:56,034 And then we do a reduction by summing them up. 289 00:15:56,034 --> 00:15:58,106 And then multiplied by the learning rate 290 00:15:58,106 --> 00:16:00,404 subtracted from the original centroid. 291 00:16:00,404 --> 00:16:04,321 That's one iteration of the SGD for such weight 292 00:16:05,292 --> 00:16:07,125 shared neural network. 293 00:16:08,613 --> 00:16:10,826 So remember previously, after pruning this is 294 00:16:10,826 --> 00:16:14,409 what the weight distribution like and after 295 00:16:16,164 --> 00:16:18,575 weight sharing, they become discrete. 296 00:16:18,575 --> 00:16:21,215 There are only 16 different values here, meaning 297 00:16:21,215 --> 00:16:25,048 we can use four bits to represent each number. 298 00:16:26,476 --> 00:16:29,764 And by training on such weight shared neural network, 299 00:16:29,764 --> 00:16:31,986 training on such extremely shared neural network, 300 00:16:31,986 --> 00:16:34,756 these weights can adjust. 301 00:16:34,756 --> 00:16:39,146 It is the subtle changes that compensated for the 302 00:16:39,146 --> 00:16:40,563 loss of accuracy. 303 00:16:41,407 --> 00:16:44,914 So let's see, this is the number of bits we give it, 304 00:16:44,914 --> 00:16:48,581 this is the accuracy for convolution layers. 305 00:16:50,095 --> 00:16:54,884 Not until four bits, does the accuracy begin to drop 306 00:16:54,884 --> 00:16:59,073 and for those fully connected layers, very astonishingly, 307 00:16:59,073 --> 00:17:02,014 it's not until two bits, only four number, does the 308 00:17:02,014 --> 00:17:03,702 accuracy begins to drop. 309 00:17:03,702 --> 00:17:06,119 And this result is per layer. 310 00:17:08,470 --> 00:17:12,404 So we have covered two methods, pruning and weight sharing. 311 00:17:12,404 --> 00:17:15,433 What if we combine these two methods together. 312 00:17:15,433 --> 00:17:16,982 Do they work well? 313 00:17:16,982 --> 00:17:20,444 So by combining those methods, this is the compression 314 00:17:20,444 --> 00:17:22,814 ratio with the smaller on the left. 315 00:17:22,814 --> 00:17:24,684 And this is the accuracy. 316 00:17:24,684 --> 00:17:27,382 We can combine it together and make the model 317 00:17:27,382 --> 00:17:32,364 about 3% of its original size without hurting the 318 00:17:32,364 --> 00:17:33,804 accuracy at all. 319 00:17:33,804 --> 00:17:36,481 Compared with the each working individual data by 320 00:17:36,481 --> 00:17:39,492 10%, accuracy begins to drop. 321 00:17:39,492 --> 00:17:41,742 And compared with the cheap SVD method, 322 00:17:41,742 --> 00:17:44,742 this has a better compression ratio. 323 00:17:46,742 --> 00:17:50,650 And final idea is we can apply the Huffman Coding 324 00:17:50,650 --> 00:17:55,031 to use more number of bits for those infrequent numbers, 325 00:17:55,031 --> 00:17:59,061 infrequently appearing weights and less number of bits 326 00:17:59,061 --> 00:18:03,351 for those more frequently appearing weights. 327 00:18:03,351 --> 00:18:06,469 So by combining these three methods, pruning, weight 328 00:18:06,469 --> 00:18:09,709 sharing, and also Huffman Coding, we can compress the 329 00:18:09,709 --> 00:18:13,490 neural networks, state-of-the-art neural networks, 330 00:18:13,490 --> 00:18:17,073 ranging from 10x to 49x without hurting the 331 00:18:20,159 --> 00:18:21,370 prediction accuracy. 332 00:18:21,370 --> 00:18:23,267 Sometimes a little bit better. 333 00:18:23,267 --> 00:18:25,948 But maybe that is noise. 334 00:18:25,948 --> 00:18:30,115 So the next question is, these models are just pre-trained 335 00:18:31,069 --> 00:18:33,509 models by say Google, Microsoft. 336 00:18:33,509 --> 00:18:37,479 Can we make a compact model, a pump compact model 337 00:18:37,479 --> 00:18:38,457 to begin with? 338 00:18:38,457 --> 00:18:40,874 Even before such compression? 339 00:18:42,297 --> 00:18:47,098 So SqueezeNet, you may have already worked with this 340 00:18:47,098 --> 00:18:50,015 neural network model in a homework. 341 00:18:50,978 --> 00:18:55,145 So the idea is we are having a squeeze layer here to shield 342 00:18:58,639 --> 00:19:01,198 at the three by three convolution with fewer number of 343 00:19:01,198 --> 00:19:02,031 channels. 344 00:19:03,669 --> 00:19:06,177 So that's where squeeze comes from. 345 00:19:06,177 --> 00:19:10,119 And here we have two branches, rather than four branches 346 00:19:10,119 --> 00:19:12,286 as in the inception model. 347 00:19:13,919 --> 00:19:16,668 So as a result, the model is extremely compact. 348 00:19:16,668 --> 00:19:19,370 It doesn't have any fully connected layers. 349 00:19:19,370 --> 00:19:20,978 Everything is fully convolutional. 350 00:19:20,978 --> 00:19:23,895 The last layer is a global pooling. 351 00:19:27,338 --> 00:19:31,698 So what if we apply deep compression algorithm 352 00:19:31,698 --> 00:19:35,738 on such already compact model will it be getting even 353 00:19:35,738 --> 00:19:36,571 smaller? 354 00:19:38,069 --> 00:19:42,389 So this is AlexNet after compression, this is SqueezeNet. 355 00:19:42,389 --> 00:19:46,556 Even before compression, it's 50x smaller than AlexNet, 356 00:19:47,498 --> 00:19:49,638 but has the same accuracy. 357 00:19:49,638 --> 00:19:53,805 After compression 510x smaller, but the same accuracy 358 00:19:56,093 --> 00:19:58,676 only less than half a megabyte. 359 00:20:00,444 --> 00:20:03,544 This means it's very easy to fit such a small model 360 00:20:03,544 --> 00:20:07,705 on the cache, which is literally 361 00:20:07,705 --> 00:20:09,538 tens of megabyte SRAM. 362 00:20:11,407 --> 00:20:12,865 So what does it mean? 363 00:20:12,865 --> 00:20:15,412 It's possible to achieve speed up. 364 00:20:15,412 --> 00:20:18,964 So this is the speedup, I measured if all these fully 365 00:20:18,964 --> 00:20:23,131 connected layers only for now, on the CPU, GPU, and 366 00:20:24,447 --> 00:20:26,601 the mobile GPU, before pruning 367 00:20:26,601 --> 00:20:28,839 and after pruning the weights, 368 00:20:28,839 --> 00:20:33,081 and on average, I observed a 3x speedup in a CPU, 369 00:20:33,081 --> 00:20:35,409 about 3X speedup on the GPU, 370 00:20:35,409 --> 00:20:39,151 and roughly 5x speedup on the mobile GPU, which is a 371 00:20:39,151 --> 00:20:39,984 TK1. 372 00:20:41,511 --> 00:20:44,679 And so is the energy efficiency. 373 00:20:44,679 --> 00:20:49,528 In an average improvement from 3x to 6x on a CPU, GPU, 374 00:20:49,528 --> 00:20:50,778 and mobile GPU. 375 00:20:52,209 --> 00:20:55,876 And these ideas are used in these companies. 376 00:20:57,998 --> 00:21:00,391 Having talked about when pruning and when sharing, 377 00:21:00,391 --> 00:21:02,791 which is a non-linear quantization method 378 00:21:02,791 --> 00:21:05,598 and we're going to talk about quantization, which is, why 379 00:21:05,598 --> 00:21:08,479 do they use in the TPU design? 380 00:21:08,479 --> 00:21:12,671 All the TPU designs use at only eight bit for inference. 381 00:21:12,671 --> 00:21:15,729 And the way, how they can use that is because of the 382 00:21:15,729 --> 00:21:16,749 quantization. 383 00:21:16,749 --> 00:21:19,332 And let's see how does it work. 384 00:21:20,248 --> 00:21:24,968 So quantization has this complicated figure, but 385 00:21:24,968 --> 00:21:26,769 the intuition is very simple. 386 00:21:26,769 --> 00:21:30,351 You run the neural network and train it with the normal 387 00:21:30,351 --> 00:21:32,268 floating point numbers. 388 00:21:33,849 --> 00:21:37,677 And quantize the weight and activations by gather 389 00:21:37,677 --> 00:21:39,700 the statistics for each layer. 390 00:21:39,700 --> 00:21:42,860 For example, what is the maximum number, minimum number, 391 00:21:42,860 --> 00:21:44,863 and how many bits are enough 392 00:21:44,863 --> 00:21:47,511 to represent this dynamic range. 393 00:21:47,511 --> 00:21:51,892 Then you use that number of bits for the integer part 394 00:21:51,892 --> 00:21:54,201 and the rest of the eight bit or seven bit 395 00:21:54,201 --> 00:21:58,118 for the other part of the 8 bit representation. 396 00:22:00,241 --> 00:22:05,041 And also we can fine tune in the floating point format. 397 00:22:05,041 --> 00:22:08,281 Or we can also use feed forward with fixed point 398 00:22:08,281 --> 00:22:11,509 and back propagation with update with the floating 399 00:22:11,509 --> 00:22:12,489 point number. 400 00:22:12,489 --> 00:22:17,391 There are lots of different ideas to have better accuracy. 401 00:22:17,391 --> 00:22:21,409 And this is the result, for how many number of bits 402 00:22:21,409 --> 00:22:23,121 versus what is the accuracy. 403 00:22:23,121 --> 00:22:26,020 For example, using a fixed, 8 bit, the accuracy for 404 00:22:26,020 --> 00:22:28,871 GoogleNet doesn't drop significantly. 405 00:22:28,871 --> 00:22:33,057 And for VGG-16, it also remains pretty well for 406 00:22:33,057 --> 00:22:34,100 the accuracy. 407 00:22:34,100 --> 00:22:36,763 While circling down to a six bit, the accuracy 408 00:22:36,763 --> 00:22:39,680 begins to drop pretty dramatically. 409 00:22:41,641 --> 00:22:44,474 Next idea, low rank approximation. 410 00:22:47,500 --> 00:22:51,083 It turned out that for a convolution layer, 411 00:22:51,951 --> 00:22:55,949 you can break it into two convolution layers. 412 00:22:55,949 --> 00:22:59,521 One convolution here, followed by a one by one convolution. 413 00:22:59,521 --> 00:23:02,441 So that it's like you break a complicated problem 414 00:23:02,441 --> 00:23:05,380 into two separate small problems. 415 00:23:05,380 --> 00:23:07,401 This is for convolution layer. 416 00:23:07,401 --> 00:23:10,292 As we can see, achieving about 417 00:23:10,292 --> 00:23:14,641 2x speedup, there's almost no loss of accuracy. 418 00:23:14,641 --> 00:23:18,529 And achieving a speedup of 5x, roughly a 6% 419 00:23:18,529 --> 00:23:19,946 loss of accuracy. 420 00:23:21,260 --> 00:23:24,020 And this also works for fully connected layers. 421 00:23:24,020 --> 00:23:28,110 The simplest idea is using the SVD to break it into 422 00:23:28,110 --> 00:23:30,721 one matrix into two matrices. 423 00:23:30,721 --> 00:23:34,888 And follow this idea, this paper proposes to use the 424 00:23:36,121 --> 00:23:40,940 Tensor Tree to break down one fully connected layer into 425 00:23:40,940 --> 00:23:43,631 a tree, lots of fully connected layers. 426 00:23:43,631 --> 00:23:46,131 That's why it's called a tree. 427 00:23:49,001 --> 00:23:52,191 So going even more crazy, can we use only 428 00:23:52,191 --> 00:23:56,671 two weights or three weights to represent a neural network? 429 00:23:56,671 --> 00:23:59,601 A ternary weight or a binary weight. 430 00:23:59,601 --> 00:24:02,531 We already seen this distribution before, after pruning. 431 00:24:02,531 --> 00:24:04,911 There's some positive weights and negative weights. 432 00:24:04,911 --> 00:24:08,791 Can we just use three numbers, just use one, minus one, zero 433 00:24:08,791 --> 00:24:12,081 to represent the neural network. 434 00:24:12,081 --> 00:24:16,452 This is our recent paper clear that we maintain 435 00:24:16,452 --> 00:24:20,852 a full precision weight during training time, 436 00:24:20,852 --> 00:24:24,292 but at inference time, we only keep the scaling factor 437 00:24:24,292 --> 00:24:26,063 and the ternary weight. 438 00:24:26,063 --> 00:24:30,831 So during inference, we only need three weights. 439 00:24:30,831 --> 00:24:35,831 That's very efficient and making the model very small. 440 00:24:35,831 --> 00:24:38,332 This is the proportion of the positive zero 441 00:24:38,332 --> 00:24:41,700 and negative weights, they can change during the training. 442 00:24:41,700 --> 00:24:44,200 So is their absolute value. 443 00:24:46,092 --> 00:24:50,236 And this is the visualization of kernels 444 00:24:50,236 --> 00:24:53,809 by this trained ternary quantization. 445 00:24:53,809 --> 00:24:57,976 We can see some of them are a corner detector like here. 446 00:24:59,336 --> 00:25:00,986 And also here. 447 00:25:00,986 --> 00:25:03,856 Some of them are maybe edge detector. 448 00:25:03,856 --> 00:25:06,107 For example, this filter some of them 449 00:25:06,107 --> 00:25:09,249 are corner detector like here this filter. 450 00:25:09,249 --> 00:25:12,537 Actually we don't need such fine grain resolution. 451 00:25:12,537 --> 00:25:15,168 Just three weights are enough. 452 00:25:15,168 --> 00:25:19,335 So this is the validation accuracy on ImageNet with AlexNet. 453 00:25:21,318 --> 00:25:24,238 So the threshline is the baseline accuracy 454 00:25:24,238 --> 00:25:26,529 with floating point 32. 455 00:25:26,529 --> 00:25:29,112 And the red line is our result. 456 00:25:29,979 --> 00:25:34,979 Pretty much the same accuracy converged compared with 457 00:25:34,979 --> 00:25:37,229 the full precision weights. 458 00:25:40,390 --> 00:25:43,307 Last idea, Winograd Transformation. 459 00:25:44,470 --> 00:25:47,491 So this about how do we implement deep neural nets, 460 00:25:47,491 --> 00:25:50,001 how do we implement the convolutions. 461 00:25:50,001 --> 00:25:52,430 So this is the conventional direct 462 00:25:52,430 --> 00:25:55,190 convolution implementation method. 463 00:25:55,190 --> 00:25:58,459 The slide credited to Julien, a friend from Nvidia. 464 00:25:58,459 --> 00:26:01,959 So originally, we just do the element wise 465 00:26:03,298 --> 00:26:06,390 do a dot product for those nine elements in the filter 466 00:26:06,390 --> 00:26:10,310 and nine elements in the image and then sum it up. 467 00:26:10,310 --> 00:26:15,179 For example, for every output we need nine times C 468 00:26:15,179 --> 00:26:18,012 number of multiplication and adds. 469 00:26:19,314 --> 00:26:23,481 Winograd Convolution is another method, equivalent method. 470 00:26:27,444 --> 00:26:31,491 It's not lost, it's an equivalent method proposed at 471 00:26:31,491 --> 00:26:33,531 first through this paper, Fast Algorithms 472 00:26:33,531 --> 00:26:35,334 for Convolution Neural Networks. 473 00:26:35,334 --> 00:26:38,212 That instead of directly doing the convolution, move 474 00:26:38,212 --> 00:26:42,379 it one by one, at first it transforms the input feature 475 00:26:43,905 --> 00:26:46,155 map to another feature map. 476 00:26:47,066 --> 00:26:51,233 Which contains only the weight, contains only 1, 0.5, 2 477 00:26:53,396 --> 00:26:56,813 that can efficiently implement it with shift. 478 00:26:56,813 --> 00:27:00,980 And also transform the filter into a four by four tensor. 479 00:27:02,076 --> 00:27:06,324 So what we are going to do here is sum over c and do an element-wise 480 00:27:06,324 --> 00:27:07,824 element-wise product. 481 00:27:08,964 --> 00:27:13,564 So there are only 16 multiplications happening here. 482 00:27:13,564 --> 00:27:18,356 And then we do a inverse transform to get four outputs. 483 00:27:18,356 --> 00:27:21,175 So the transform and the inverse transform can be 484 00:27:21,175 --> 00:27:24,932 amortized and the multiplications, whether it can ignored. 485 00:27:24,932 --> 00:27:29,099 So in order to get four output, we need nine times channel 486 00:27:30,524 --> 00:27:34,444 times four, which is 36 times channel. 487 00:27:34,444 --> 00:27:39,093 Multiplications originally for the direct convolution 488 00:27:39,093 --> 00:27:42,676 but now we need 16 times C of our output 489 00:27:46,655 --> 00:27:50,822 So that is 2.25x less number of multiplications to 490 00:27:53,916 --> 00:27:57,083 perform the exact same multiplication. 491 00:27:58,306 --> 00:27:59,807 And here is a speedup. 492 00:27:59,807 --> 00:28:03,974 2.25x, so theoretically, 2.25x speedup and in real, 493 00:28:07,694 --> 00:28:10,611 from cuDNN 5 they incorporated such 494 00:28:11,570 --> 00:28:14,916 Winograd Convolution algorithm. 495 00:28:14,916 --> 00:28:19,234 This is on the VGG net I believe, the speedup is 496 00:28:19,234 --> 00:28:21,401 roughly 1.7 to 2x speedup. 497 00:28:23,735 --> 00:28:25,318 Pretty significant. 498 00:28:27,314 --> 00:28:31,147 And after cuDNN 5, the cuDNN begins to use the 499 00:28:33,564 --> 00:28:36,147 Winograd Convolution algorithm. 500 00:28:38,586 --> 00:28:43,354 Okay, so far we have covered those efficient algorithms 501 00:28:43,354 --> 00:28:45,666 for efficient inference. 502 00:28:45,666 --> 00:28:48,978 We covered pruning, weight sharing, quantization, 503 00:28:48,978 --> 00:28:52,061 and also Winograd binary and ternary. 504 00:28:53,436 --> 00:28:57,603 So now let's see what is the optimal hardware for those 505 00:28:59,196 --> 00:29:00,805 efficient inference? 506 00:29:00,805 --> 00:29:02,888 And what is a Google TPU? 507 00:29:05,018 --> 00:29:08,685 So there are a wide range of domain specific 508 00:29:09,567 --> 00:29:14,286 architectures or ASICS for deep neural networks. 509 00:29:14,286 --> 00:29:16,745 They have a common goal is to minimize the memory 510 00:29:16,745 --> 00:29:18,495 access to save power. 511 00:29:20,595 --> 00:29:24,708 For example the Eyeriss from MIT by using the RS Dataflow 512 00:29:24,708 --> 00:29:28,028 to minimize the off chip direct access. 513 00:29:28,028 --> 00:29:30,818 And DaDiannao from China Academy of Science, 514 00:29:30,818 --> 00:29:33,906 buffered all the weights on chip DRAM instead of having 515 00:29:33,906 --> 00:29:35,823 to go to off-chip DRAM. 516 00:29:37,287 --> 00:29:42,108 So the TPU from Google is using eight bit integer 517 00:29:42,108 --> 00:29:44,258 to represent the numbers. 518 00:29:44,258 --> 00:29:47,319 And at Stanford I proposed the EIE architecture 519 00:29:47,319 --> 00:29:49,496 that support those compressed and 520 00:29:49,496 --> 00:29:53,067 sparse deep neural network inference. 521 00:29:53,067 --> 00:29:56,668 So this is what the TPU looks like. 522 00:29:56,668 --> 00:30:00,835 It's actually smartly, can be put into the disk drive 523 00:30:03,267 --> 00:30:06,267 up to four cards per server. 524 00:30:06,267 --> 00:30:09,039 And this is the high-level architecture 525 00:30:09,039 --> 00:30:10,622 for the Google TPU. 526 00:30:12,386 --> 00:30:17,239 Don't be overwhelmed, it's actually, the kernel part 527 00:30:17,239 --> 00:30:21,156 here, is this giant matrix multiplication unit. 528 00:30:23,218 --> 00:30:27,218 So it's a 256 by 256 matrix multiplication unit. 529 00:30:28,698 --> 00:30:32,531 So in one single cycle, it can perform 64 kilo 530 00:30:37,177 --> 00:30:41,028 those number of multiplication and accumulate operations. 531 00:30:41,028 --> 00:30:44,861 So running 700 Megahertz, the throughput is 92 532 00:30:47,708 --> 00:30:49,208 Teraops per second 533 00:30:52,380 --> 00:30:55,319 because it's actually integer operation. 534 00:30:55,319 --> 00:30:59,486 So we just about 25x as GPU and more than 100x at the CPU. 535 00:31:01,799 --> 00:31:05,966 And notice, TPU has a really large software-managed 536 00:31:07,711 --> 00:31:09,541 on-chip buffer. 537 00:31:09,541 --> 00:31:11,124 It is 24 megabytes. 538 00:31:13,550 --> 00:31:18,375 The cache for the CPU the L3 cache is already 539 00:31:18,375 --> 00:31:19,720 16 megabytes. 540 00:31:19,720 --> 00:31:24,093 This is 24 megabytes which is pretty large. 541 00:31:24,093 --> 00:31:28,453 And it's powered by two DDR3 DRAM channels. 542 00:31:28,453 --> 00:31:32,536 So this is a little weak because the bandwidth is 543 00:31:33,783 --> 00:31:37,950 only 30 gigabytes per second compared with the most 544 00:31:39,151 --> 00:31:42,984 recent GPU that HBM, 900 Gigabytes per second. 545 00:31:47,543 --> 00:31:51,751 The DDR4 is released in 2014, so that makes sense because 546 00:31:51,751 --> 00:31:55,493 the design is a little during that day, used the DDR3. 547 00:31:55,493 --> 00:32:00,391 But if you're using DDR4 or even high-bandwidth memory, 548 00:32:00,391 --> 00:32:03,391 the performance can be even boosted. 549 00:32:05,011 --> 00:32:08,303 So this is a comparison about Google's TPU compared 550 00:32:08,303 --> 00:32:12,470 with the CPU, GPU of this K80 GPU by the way, and the TPU. 551 00:32:15,800 --> 00:32:19,743 So the area is pretty much smaller, like half the size of a 552 00:32:19,743 --> 00:32:23,910 CPU and GPU and the power consumption is roughly 75 watts. 553 00:32:28,562 --> 00:32:32,562 And see this number, the peak teraops per second 554 00:32:33,482 --> 00:32:38,103 is much higher than the CPU and GPU is, about 90 555 00:32:38,103 --> 00:32:41,520 teraops per second, which is pretty high. 556 00:32:42,602 --> 00:32:44,922 So here is a workload. 557 00:32:44,922 --> 00:32:47,983 Thanks to David sharing the slide. 558 00:32:47,983 --> 00:32:51,060 This is the workload at Google. 559 00:32:51,060 --> 00:32:54,380 They did a benchmark on these TPUs. 560 00:32:54,380 --> 00:32:58,804 So it's a little interesting that convolution neural nets 561 00:32:58,804 --> 00:33:03,711 only account for 5% of data-center workload. 562 00:33:03,711 --> 00:33:06,860 Most of them is multilayer perception, 563 00:33:06,860 --> 00:33:08,329 those fully connected layers. 564 00:33:08,329 --> 00:33:12,569 About 61% maybe for ads, I'm not sure. 565 00:33:12,569 --> 00:33:17,058 And about 29% of the workload in data-center is the 566 00:33:17,058 --> 00:33:18,369 Long Short Term Memory. 567 00:33:18,369 --> 00:33:20,391 For example, speech recognition, 568 00:33:20,391 --> 00:33:23,224 or machine translation, I suspect. 569 00:33:28,475 --> 00:33:31,129 Remember just now we have seen there are 570 00:33:31,129 --> 00:33:33,569 90 teraops per second. 571 00:33:33,569 --> 00:33:37,671 But what actually number of teraops per second 572 00:33:37,671 --> 00:33:39,239 can be achieved? 573 00:33:39,239 --> 00:33:43,449 This is a basic tool to measure the bottleneck 574 00:33:43,449 --> 00:33:45,688 of a computer system. 575 00:33:45,688 --> 00:33:49,647 Whether you are bottlenecked by the arithmetic or 576 00:33:49,647 --> 00:33:53,267 you are bottlenecked by the memory bandwidth. 577 00:33:53,267 --> 00:33:54,817 It's like if you have a bucket, 578 00:33:54,817 --> 00:33:58,548 the lowest part of the bucket determines how much 579 00:33:58,548 --> 00:34:01,087 water we can hold in the bucket. 580 00:34:01,087 --> 00:34:04,337 So in this region, you are bottlenecked 581 00:34:05,927 --> 00:34:07,977 by the memory bandwidth. 582 00:34:07,977 --> 00:34:11,477 So the x-axis is the arithmetic intensity. 583 00:34:13,945 --> 00:34:18,112 Which is number of floating point operations per byte 584 00:34:19,745 --> 00:34:22,415 the ratio between the computation and memory 585 00:34:22,415 --> 00:34:24,248 of bandwidth overhead. 586 00:34:26,047 --> 00:34:30,214 So the y-axis, is the actual attainable performance. 587 00:34:32,967 --> 00:34:36,664 Here is the peak performance for example. 588 00:34:36,664 --> 00:34:40,116 When you do a lot of operation after you fetch a single 589 00:34:40,116 --> 00:34:42,574 piece of data, if you can do a lot of operation 590 00:34:42,574 --> 00:34:46,995 on top of it, then you are bottlenecked by the arithmetic. 591 00:34:46,996 --> 00:34:51,714 But after you fetch a lot of data from the memory, 592 00:34:51,714 --> 00:34:55,916 but you just do a tiny little bit of arithmetic, 593 00:34:55,916 --> 00:35:00,054 then you will be bottlenecked by the memory bandwidth. 594 00:35:00,054 --> 00:35:04,704 So how much you can fetch from the memory determines 595 00:35:04,704 --> 00:35:08,214 how much real performance you can get. 596 00:35:08,214 --> 00:35:10,065 And remember there is a ratio. 597 00:35:10,065 --> 00:35:15,047 When it is one here, this region it happens to be the same 598 00:35:15,047 --> 00:35:17,854 as the turning point is the actual 599 00:35:17,854 --> 00:35:20,521 memory bandwidth of your system. 600 00:35:21,476 --> 00:35:24,407 So let's see what is the life for the TPU. 601 00:35:24,407 --> 00:35:26,825 The TPU's peak performance is really high, 602 00:35:26,825 --> 00:35:28,908 about 90 Tops per second. 603 00:35:30,623 --> 00:35:34,790 For those convolution nets, they are pretty much saturating 604 00:35:39,915 --> 00:35:41,825 the peak performance. 605 00:35:41,825 --> 00:35:45,644 But there are lot of neural networks that has a utlitization 606 00:35:45,644 --> 00:35:47,227 less than 10%, 607 00:35:49,905 --> 00:35:53,572 meaning that 90 T-ops per second is actually 608 00:35:54,985 --> 00:35:59,152 achieves about three to 12 T-ops per second in real case. 609 00:36:03,244 --> 00:36:05,185 But why is it like that? 610 00:36:05,185 --> 00:36:09,352 The reason is, in order to have those real-time guarantee 611 00:36:10,882 --> 00:36:14,691 that the user not wait for too long, you cannot batch 612 00:36:14,691 --> 00:36:18,002 a lot of user's images or speech voice data 613 00:36:18,002 --> 00:36:19,354 at the same time. 614 00:36:19,354 --> 00:36:22,811 So as a result, for those fully connect layers, 615 00:36:22,811 --> 00:36:26,978 they have very little reuse, so they are bottlenecked 616 00:36:28,634 --> 00:36:31,453 by the memory bandwidth. 617 00:36:31,453 --> 00:36:35,584 For those convolution neural nets, for example this one, 618 00:36:35,584 --> 00:36:39,417 this blue one, that achieve 86, which is CNN0. 619 00:36:42,333 --> 00:36:44,750 The ratio between the ops and 620 00:36:48,632 --> 00:36:51,872 the number of memory is the highest. 621 00:36:51,872 --> 00:36:56,039 It's pretty high, more than 2,000 compared with other 622 00:36:57,722 --> 00:37:00,722 multilayer perceptron or long short term memory 623 00:37:00,722 --> 00:37:02,722 the ratio is pretty low. 624 00:37:04,389 --> 00:37:08,556 So this figure compares, this is the TPU and this one is 625 00:37:09,682 --> 00:37:11,765 the CPU, this is the GPU. 626 00:37:13,021 --> 00:37:16,352 Here is memory bandwidth, the peak memory bandwidth 627 00:37:16,352 --> 00:37:17,792 at a ratio of one here. 628 00:37:17,792 --> 00:37:20,538 So TPU has the highest memory bandwidth. 629 00:37:20,538 --> 00:37:24,402 And here is where are these neural networks 630 00:37:24,402 --> 00:37:26,072 lie on this curve. 631 00:37:26,072 --> 00:37:28,538 So the asterisk is for the TPU. 632 00:37:28,538 --> 00:37:31,371 It's still higher than other dots, 633 00:37:32,890 --> 00:37:37,057 but if you're not comfortable with this log scale figure, 634 00:37:38,232 --> 00:37:42,399 this is what it's like putting it in linear roofline. 635 00:37:43,781 --> 00:37:46,819 So pretty much everything disappeared except 636 00:37:46,819 --> 00:37:48,486 for the TPU results. 637 00:37:51,562 --> 00:37:54,381 So still, all these lines, although they are higher 638 00:37:54,381 --> 00:37:57,282 than the CPU and GPU, it's still way below the 639 00:37:57,282 --> 00:38:00,532 theoretical peak operations per second. 640 00:38:06,031 --> 00:38:08,802 So as I mentioned before, it is really bottlenecked 641 00:38:08,802 --> 00:38:11,780 by the low latency requirement so that it can have 642 00:38:11,780 --> 00:38:13,402 a large batch size. 643 00:38:13,402 --> 00:38:16,762 That's why you have low operations per byte. 644 00:38:16,762 --> 00:38:18,610 And how do you solve this problem? 645 00:38:18,610 --> 00:38:21,250 You want to have less number of memory footprint 646 00:38:21,250 --> 00:38:25,417 so that it can reduce the memory bandwidth requirement. 647 00:38:27,219 --> 00:38:30,449 One solution is to compress the model and the challenge 648 00:38:30,449 --> 00:38:35,179 is how do we build a hardware that can do inference 649 00:38:35,179 --> 00:38:38,387 directly on the compressed model? 650 00:38:38,387 --> 00:38:42,238 So I'm going to introduce my design of EIE, the Efficient 651 00:38:42,238 --> 00:38:46,347 Inference Engine, which deals with those sparse 652 00:38:46,347 --> 00:38:49,755 and the compressed model to save the memory bandwidth. 653 00:38:49,755 --> 00:38:52,124 And the rule of thumb, like we mentioned before is taking 654 00:38:52,124 --> 00:38:53,995 out one bit of sparsity first. 655 00:38:53,995 --> 00:38:56,366 Anything times zero is zero. 656 00:38:56,366 --> 00:38:59,697 So don't store it, don't compute on it. 657 00:38:59,697 --> 00:39:04,286 And second idea is, you don't need that much full precision, 658 00:39:04,286 --> 00:39:06,857 but you can approximate it. 659 00:39:06,857 --> 00:39:10,279 So by taking advantage of the sparse weight, we 660 00:39:10,279 --> 00:39:15,097 get about a 10x saving in the computation, 5x less 661 00:39:15,097 --> 00:39:16,345 memory footprint. 662 00:39:16,345 --> 00:39:19,645 The 2x difference is due to index overhead. 663 00:39:19,645 --> 00:39:22,555 And by taking advantage of the sparse activation, 664 00:39:22,555 --> 00:39:26,633 meaning after bandwidth, if activation is zero, then 665 00:39:26,633 --> 00:39:27,795 ignore it. 666 00:39:27,795 --> 00:39:30,712 You save another 3x of computation. 667 00:39:32,454 --> 00:39:35,465 And then by such weight sharing mechanism, 668 00:39:35,465 --> 00:39:39,382 you can use four bits to represent each weight rather 669 00:39:39,382 --> 00:39:41,144 than 32 bit. 670 00:39:41,144 --> 00:39:45,311 That's another eight times saving in the memory footprint. 671 00:39:48,195 --> 00:39:51,894 So this is physically, logically how the weights are stored. 672 00:39:51,894 --> 00:39:56,214 A four by eight matrix, and this is how physically 673 00:39:56,214 --> 00:39:57,475 they are stored. 674 00:39:57,475 --> 00:40:00,558 Only the non-zero weights are stored. 675 00:40:02,294 --> 00:40:04,995 So you don't need to store those zeroes. 676 00:40:04,995 --> 00:40:07,675 You'll save the bandwidth fetching those zeroes. 677 00:40:07,675 --> 00:40:12,334 And also I'm using the relative index to further save 678 00:40:12,334 --> 00:40:14,834 the number of memory overhead. 679 00:40:21,254 --> 00:40:25,634 So in the computation like this figure shows, 680 00:40:25,634 --> 00:40:29,801 we are running the multiplication only on non-zero. 681 00:40:31,283 --> 00:40:33,533 If it's zero, then skip it. 682 00:40:34,585 --> 00:40:38,002 Only broadcast it to the non-zero weights 683 00:40:39,123 --> 00:40:42,131 and if it is zero, skip it. 684 00:40:42,131 --> 00:40:45,883 If it's a non-zero, do the multiplication. 685 00:40:45,883 --> 00:40:48,499 In another cycle, do the multiplication. 686 00:40:48,499 --> 00:40:52,666 So the idea is anything multiplied by zero is zero. 687 00:40:54,142 --> 00:40:55,820 So this is a little complicated, 688 00:40:55,820 --> 00:40:58,283 I'm going to go very quickly. 689 00:40:58,283 --> 00:41:01,428 I'm going to have a lookup table that decode the four bit 690 00:41:01,428 --> 00:41:04,923 weight into the 16 bit weight and using the four bit 691 00:41:04,923 --> 00:41:08,083 relative index passed through address accumulator 692 00:41:08,083 --> 00:41:11,411 to get the 16 bit absolute index. 693 00:41:11,411 --> 00:41:13,393 And this is what the hardware architecture 694 00:41:13,393 --> 00:41:15,323 like in the high level. 695 00:41:15,323 --> 00:41:19,723 You can feel free to refer to my paper for detail. 696 00:41:19,723 --> 00:41:21,523 Okay speedup. 697 00:41:21,523 --> 00:41:24,203 So using such efficient hardware architecture 698 00:41:24,203 --> 00:41:28,203 and also model compression, this is the original 699 00:41:29,713 --> 00:41:32,553 result we have seen for CPU, GPU, mobile GPU. 700 00:41:32,553 --> 00:41:34,592 Now EIE is here. 701 00:41:34,592 --> 00:41:38,759 189 times faster than the CPU and about 13 times faster 702 00:41:39,833 --> 00:41:40,916 than the GPU. 703 00:41:43,302 --> 00:41:46,941 So this is the energy efficiency on the log scale, 704 00:41:46,941 --> 00:41:50,763 it's about 24,000x more energy efficient than a CPU 705 00:41:50,763 --> 00:41:55,043 and about 3000x more energy efficient than a GPU. 706 00:41:55,043 --> 00:41:58,318 It means for example, previously if your battery can 707 00:41:58,318 --> 00:42:00,934 last for one hour, now it can last for 708 00:42:00,934 --> 00:42:02,851 3000 hours for example. 709 00:42:06,174 --> 00:42:09,952 So if you say, ASIC is always better than CPUs and GPUs 710 00:42:09,952 --> 00:42:12,294 because it's customized hardware. 711 00:42:12,294 --> 00:42:16,442 So this is comparing EIE with the peer ASIC, for example 712 00:42:16,442 --> 00:42:18,775 DaDianNao and the TrueNorth. 713 00:42:20,803 --> 00:42:25,305 It has a better throughput, better energy efficiency 714 00:42:25,305 --> 00:42:28,825 by order of magnitude, compared with other ASICs. 715 00:42:28,825 --> 00:42:31,992 Not to mention that CPU, GPU and FPGAs. 716 00:42:33,134 --> 00:42:36,384 So we have covered half of the journey. 717 00:42:37,534 --> 00:42:39,812 We mentioned inference, we pretty much 718 00:42:39,812 --> 00:42:41,723 covered everything for inference. 719 00:42:41,723 --> 00:42:44,625 Now we are going to switch gear and talk about training. 720 00:42:44,625 --> 00:42:47,011 How do we train neural networks efficiently, 721 00:42:47,011 --> 00:42:48,931 how do we train it faster? 722 00:42:48,931 --> 00:42:51,811 So again, we are starting with algorithm first, 723 00:42:51,811 --> 00:42:55,262 efficient algorithms followed by the hardware 724 00:42:55,262 --> 00:42:57,179 for efficient training. 725 00:43:00,479 --> 00:43:03,161 So for efficient training algorithms, I'm going to mention 726 00:43:03,161 --> 00:43:04,198 four topics. 727 00:43:04,198 --> 00:43:07,959 The first one is parallelization, and then mixed precision 728 00:43:07,959 --> 00:43:12,131 training, which was just released about one month ago 729 00:43:12,131 --> 00:43:15,768 and at NVIDIA GTC, so it's fresh knowledge. 730 00:43:15,768 --> 00:43:18,971 And then model distillation, followed by my work on 731 00:43:18,971 --> 00:43:20,961 Dense-Sparse-Dense training, or better Regularization 732 00:43:20,961 --> 00:43:21,794 technique. 733 00:43:22,681 --> 00:43:26,121 So let's start with parallelization. 734 00:43:26,121 --> 00:43:29,542 So this figure shows, anyone in the hardware community. 735 00:43:29,542 --> 00:43:31,229 Most are very familiar with this figure. 736 00:43:31,229 --> 00:43:35,038 So as time goes by, what is the trend? 737 00:43:35,038 --> 00:43:38,422 For the number of transistors is keeping increasing. 738 00:43:38,422 --> 00:43:43,030 But the single threaded performance is getting plateaued 739 00:43:43,030 --> 00:43:44,371 in recent years. 740 00:43:44,371 --> 00:43:48,161 And also the frequency is getting plateaued in recent years. 741 00:43:48,161 --> 00:43:52,350 Because of the power constraint, to stop not scaling. 742 00:43:52,350 --> 00:43:56,517 And interesting thing is the number of cores is increasing. 743 00:43:57,757 --> 00:44:00,198 So what we really need to do is parallelization. 744 00:44:00,198 --> 00:44:03,427 How do we parallelize the problem to take advantage 745 00:44:03,427 --> 00:44:05,827 of parallel processing? 746 00:44:05,827 --> 00:44:10,804 Actually there are a lot of opportunities for parallelism 747 00:44:10,804 --> 00:44:12,756 in deep neural networks. 748 00:44:12,756 --> 00:44:15,572 For example, we can do data parallel. 749 00:44:15,572 --> 00:44:20,332 For example, feeding two images into the same model 750 00:44:20,332 --> 00:44:23,026 and run them at the same time. 751 00:44:23,026 --> 00:44:26,156 This doesn't affect latency for a single input. 752 00:44:26,156 --> 00:44:30,786 It doesn't make it shorter, but it makes batch size larger 753 00:44:30,786 --> 00:44:35,084 basically if you have four machines our effective batch 754 00:44:35,084 --> 00:44:38,626 size becomes four times as before. 755 00:44:38,626 --> 00:44:42,684 So it requires the coordinated weight update. 756 00:44:42,684 --> 00:44:46,101 For example, this is a paper from Google. 757 00:44:46,973 --> 00:44:51,140 There is a parameter server as a master and a couple of 758 00:44:52,564 --> 00:44:56,731 slaves running their own piece of training data and update 759 00:44:59,032 --> 00:45:03,154 the gradient to the parameter server and get the updated 760 00:45:03,154 --> 00:45:05,571 weight for them individually, 761 00:45:07,312 --> 00:45:11,063 that's how data parallelism is handled. 762 00:45:11,063 --> 00:45:14,604 Another idea is there could be a model parallelism. 763 00:45:14,604 --> 00:45:17,524 You can sublet your model and handle it 764 00:45:17,524 --> 00:45:21,383 to different processors or different threads. 765 00:45:21,383 --> 00:45:25,543 For example, there's this image, you want to run convolution 766 00:45:25,543 --> 00:45:29,293 on this image that is six dimension for loop. 767 00:45:30,530 --> 00:45:35,271 What you can do is you can cut the input image by 768 00:45:35,271 --> 00:45:39,482 two by two blocks so that each thread, or each processor 769 00:45:39,482 --> 00:45:42,619 handles one fourth of the image. 770 00:45:42,619 --> 00:45:45,580 Although there's a small halo here in between you 771 00:45:45,580 --> 00:45:47,330 have to take care of. 772 00:45:48,260 --> 00:45:50,860 And also, you can parallelize by the 773 00:45:50,860 --> 00:45:53,193 output or input feature map. 774 00:45:54,730 --> 00:45:56,911 And for those fully connect layers, 775 00:45:56,911 --> 00:45:58,500 how do we parallelize the model? 776 00:45:58,500 --> 00:45:59,442 It's even simpler. 777 00:45:59,442 --> 00:46:02,420 You can cut the model into half 778 00:46:02,420 --> 00:46:05,337 and hand it to different threads. 779 00:46:06,551 --> 00:46:07,991 And the third idea, you can even do 780 00:46:07,991 --> 00:46:09,378 hyper-parameter parallel. 781 00:46:09,378 --> 00:46:11,762 For example, you can tune your learning rate, your 782 00:46:11,762 --> 00:46:14,402 weight decay for different machines 783 00:46:14,402 --> 00:46:16,400 for those coarse-grained parallelism. 784 00:46:16,400 --> 00:46:20,780 So there are so many alternatives you have to tune. 785 00:46:20,780 --> 00:46:23,631 Small summary of the parallelism. 786 00:46:23,631 --> 00:46:27,031 There are lots of parallelisms in deep neural networks. 787 00:46:27,031 --> 00:46:30,271 For example, with data parallelism, you can run multiple 788 00:46:30,271 --> 00:46:34,820 training images, but you cannot have unlimited number 789 00:46:34,820 --> 00:46:38,940 of processors because you are limited by batch size. 790 00:46:38,940 --> 00:46:42,068 If it's too large, stochastic gradient descent 791 00:46:42,068 --> 00:46:44,438 becomes gradient descent, that's not good. 792 00:46:44,438 --> 00:46:47,277 You can also run the model parallelism. 793 00:46:47,277 --> 00:46:50,466 Split the model, either by cutting the image or 794 00:46:50,466 --> 00:46:53,133 cutting the convolution weights. 795 00:46:58,598 --> 00:47:01,223 Either cutting the image or cutting 796 00:47:01,223 --> 00:47:03,940 the fully connected layers. 797 00:47:03,940 --> 00:47:08,319 So it's very easy to get 16 to 64 GPUs training one model 798 00:47:08,319 --> 00:47:10,490 in parallel, having very good speedup. 799 00:47:10,490 --> 00:47:12,323 Almost linear speedup. 800 00:47:13,810 --> 00:47:17,988 Okay, next interesting thing, mixed precision with 801 00:47:17,988 --> 00:47:19,071 FP16 or FP32. 802 00:47:21,319 --> 00:47:23,370 So remember in the beginning of this lecture, 803 00:47:23,370 --> 00:47:28,207 I had a chart showing the energy and area overhead for 804 00:47:28,207 --> 00:47:30,290 a 16 bit versus a 32 bit. 805 00:47:31,887 --> 00:47:36,054 Going from 32 bit to 16 bit, you save about 4x the energy 806 00:47:37,890 --> 00:47:39,223 and 4x the area. 807 00:47:40,528 --> 00:47:43,340 So can we train a deep neural network with such low 808 00:47:43,340 --> 00:47:47,831 precision with floating point 16 bit rather than 32 bit? 809 00:47:47,831 --> 00:47:50,998 It turns out we can do that partially. 810 00:47:53,498 --> 00:47:58,250 By partially, I mean we need FP32 in some places. 811 00:47:58,250 --> 00:48:01,090 And where are those places? 812 00:48:01,090 --> 00:48:05,257 So we can do the multiplication in 16 bit as input. 813 00:48:07,951 --> 00:48:11,476 And then we have to do the summation 814 00:48:11,476 --> 00:48:13,879 in 32 bit accumulation. 815 00:48:13,879 --> 00:48:18,860 And then convert the result to 32 bit to store the weight. 816 00:48:18,860 --> 00:48:22,777 So that's where the mixed precision comes from. 817 00:48:25,108 --> 00:48:28,140 So for example, we have a master weight stored in 818 00:48:28,140 --> 00:48:31,932 floating point 32, we down converted it to floating 819 00:48:31,932 --> 00:48:36,099 point 16 and then we do the feed forward with 16 bit 820 00:48:37,612 --> 00:48:42,290 weight, 16 bit activation, we get a 16 bit activation 821 00:48:42,290 --> 00:48:46,522 here in the end when we are doing back propagation 822 00:48:46,522 --> 00:48:50,689 of the computation is also done with floating point 16 bit. 823 00:48:52,700 --> 00:48:57,351 Very interesting here, for the weights we get a floating 824 00:48:57,351 --> 00:49:00,851 point 16 bit gradient here for the weight. 825 00:49:03,255 --> 00:49:07,422 But when we are doing the update, so W plus learning 826 00:49:09,598 --> 00:49:13,154 rate times the gradient, that operation has 827 00:49:13,154 --> 00:49:14,904 to be done in 32 bit. 828 00:49:17,740 --> 00:49:20,943 That's where the mixed precision is coming from. 829 00:49:20,943 --> 00:49:24,692 And see there are two colors, which here is 16 bit, 830 00:49:24,692 --> 00:49:26,514 here is the 32 bit. 831 00:49:26,514 --> 00:49:30,181 That's where the mixed precision comes from. 832 00:49:31,284 --> 00:49:36,212 So does such low precision sacrifice your prediction 833 00:49:36,212 --> 00:49:38,884 accuracy for your model? 834 00:49:38,884 --> 00:49:43,051 So this is the figure from NVIDIA just released a couple 835 00:49:43,914 --> 00:49:45,747 of weeks ago actually. 836 00:49:46,652 --> 00:49:49,819 Thanks to Paulius giving me the slide. 837 00:49:51,431 --> 00:49:55,751 The convergence between floating point 32 versus 838 00:49:55,751 --> 00:49:58,500 the multi tensor up, which is basically the mixed 839 00:49:58,500 --> 00:50:00,842 precision training, are actually pretty much 840 00:50:00,842 --> 00:50:02,932 the same for convergence. 841 00:50:02,932 --> 00:50:04,762 If you zoom it in a little bit, 842 00:50:04,762 --> 00:50:06,690 they are pretty much the same. 843 00:50:06,690 --> 00:50:11,052 And for ResNet, the mixed precision sometimes behaves 844 00:50:11,052 --> 00:50:14,771 a little better than the full precision weight. 845 00:50:14,771 --> 00:50:17,234 Maybe because of noise. 846 00:50:17,234 --> 00:50:20,582 But in the end, after you train the model, this is 847 00:50:20,582 --> 00:50:24,762 the result of AlexNet, Inception V3, and ResNet-50 848 00:50:24,762 --> 00:50:28,679 with FP32 versus FP16 mixed precision training. 849 00:50:29,881 --> 00:50:32,721 The accuracy is pretty much the same 850 00:50:32,721 --> 00:50:33,962 for these two methods. 851 00:50:33,962 --> 00:50:37,295 A little bit worse, but not by too much. 852 00:50:40,042 --> 00:50:43,714 So having talked about the mixed precision training, 853 00:50:43,714 --> 00:50:47,881 the next idea is to train with model distillation. 854 00:50:49,703 --> 00:50:52,412 For example, you can have multiple neural networks, 855 00:50:52,412 --> 00:50:55,863 Googlenet, Vggnet, Resnet for example. 856 00:50:55,863 --> 00:51:00,030 And the question is, can we take advantage of these 857 00:51:00,943 --> 00:51:02,092 different models? 858 00:51:02,092 --> 00:51:05,132 Of course we can do model ensemble, can we utilitze them 859 00:51:05,132 --> 00:51:09,299 as teacher, to teach a small junior neural network to have 860 00:51:11,201 --> 00:51:15,434 it perform as good as the senior neural network. 861 00:51:15,434 --> 00:51:17,090 So this is the idea. 862 00:51:17,090 --> 00:51:21,257 You have multiple large powerful senior neural networks 863 00:51:23,314 --> 00:51:25,202 to teach this student model. 864 00:51:25,202 --> 00:51:28,881 And hopefully it can get better results. 865 00:51:28,881 --> 00:51:32,372 And the idea to do that is, instead of using this 866 00:51:32,372 --> 00:51:37,162 hard label, for example for car, dog, cat, the probability 867 00:51:37,162 --> 00:51:41,329 for dog is 100%, but the output of the geometric 868 00:51:42,383 --> 00:51:46,063 ensemble of those large teacher neural networks 869 00:51:46,063 --> 00:51:50,230 maybe the dog has 90% and the cat is about 10%, 870 00:51:53,282 --> 00:51:55,492 and the magic happens here. 871 00:51:55,492 --> 00:51:59,071 You want to have a softened result label here. 872 00:51:59,071 --> 00:52:03,071 For example, the dog is 30%, the cat is 20%. 873 00:52:03,071 --> 00:52:05,471 Still the dog is higher than the cat. 874 00:52:05,471 --> 00:52:09,260 So the prediction is still correct, but it uses 875 00:52:09,260 --> 00:52:13,427 this soft label to train the student neural network 876 00:52:15,431 --> 00:52:19,460 rather than use this hard label to train 877 00:52:19,460 --> 00:52:21,991 the student neural network. 878 00:52:21,991 --> 00:52:26,740 And mathematically, you control how much do you make 879 00:52:26,740 --> 00:52:30,482 it soft by this temperature during the soft max 880 00:52:30,482 --> 00:52:33,149 controlling by this temperature. 881 00:52:34,322 --> 00:52:36,751 And the result is that, starting with the trained model 882 00:52:36,751 --> 00:52:40,918 that classifies 58.9% of the test frames correctly, 883 00:52:43,099 --> 00:52:46,099 the new model converges to 57%. 884 00:52:47,340 --> 00:52:50,173 Only train on 3% of the data. 885 00:52:52,699 --> 00:52:54,882 So that's the magic for model distillation 886 00:52:54,882 --> 00:52:56,715 using this soft label. 887 00:52:59,191 --> 00:53:02,460 And the last idea is my recent paper using 888 00:53:02,460 --> 00:53:06,242 a better regularization to train deep neural nets. 889 00:53:06,242 --> 00:53:07,908 We have seen these two figures before. 890 00:53:07,908 --> 00:53:09,929 We pruned the neural network, having less number 891 00:53:09,929 --> 00:53:12,300 of weights, but have the same accuracy. 892 00:53:12,300 --> 00:53:15,439 Now what I did is to recover and to retrain those 893 00:53:15,439 --> 00:53:18,271 weights shown in red and make everything train 894 00:53:18,271 --> 00:53:21,625 out together to increase the model capacity after 895 00:53:21,625 --> 00:53:24,887 it is trained at a low dimensional space. 896 00:53:24,887 --> 00:53:27,528 It's like you learn the trunk first and then gradually 897 00:53:27,528 --> 00:53:31,071 add those leaves and learn everything together. 898 00:53:31,071 --> 00:53:35,238 It turns out, on ImageNet it performs relatively about 1% to 899 00:53:37,471 --> 00:53:41,020 4% absolute improvement of accuracy. 900 00:53:41,020 --> 00:53:44,998 And is also general purpose, works on long-short term memory 901 00:53:44,998 --> 00:53:49,330 and also recurrent neural nets collaborated with Baidu. 902 00:53:49,330 --> 00:53:52,610 So I also open sourced this special training model 903 00:53:52,610 --> 00:53:56,460 on the DSD Model Zoo, where there are trained, all 904 00:53:56,460 --> 00:54:00,490 these models, GoogleNet, VGG, ResNet, and also SqueezeNet, 905 00:54:00,490 --> 00:54:01,969 and also AlexNet. 906 00:54:01,969 --> 00:54:05,099 So if you are interested, feel free to check out this 907 00:54:05,099 --> 00:54:09,182 Model Zoo and compare it with the Caffe Model Zoo. 908 00:54:11,010 --> 00:54:14,998 Here's some examples on dense-spare-dense training helps 909 00:54:14,998 --> 00:54:16,581 with image capture. 910 00:54:17,878 --> 00:54:21,396 For example, this is a very challenging figure. 911 00:54:21,396 --> 00:54:24,087 The original baseline of neural talk says a boy in 912 00:54:24,087 --> 00:54:27,318 a red shirt is climbing a rock wall. 913 00:54:27,318 --> 00:54:29,179 And the sparse model says a young girl is jumping 914 00:54:29,179 --> 00:54:31,849 off a tree, probably mistaking the hair with either 915 00:54:31,849 --> 00:54:33,729 the rock or the tree. 916 00:54:33,729 --> 00:54:36,278 But then sparse-dense training by using this kind of 917 00:54:36,278 --> 00:54:39,100 regularization on a low dimensional space, it says 918 00:54:39,100 --> 00:54:42,958 a young girl in a pink shirt is swinging on a swing. 919 00:54:42,958 --> 00:54:47,070 And there are a lot of examples due to the limit of time, 920 00:54:47,070 --> 00:54:49,129 I will not go over them one by one. 921 00:54:49,129 --> 00:54:51,150 For example, a group of people are standing in front 922 00:54:51,150 --> 00:54:53,118 of a building, there's no building. 923 00:54:53,118 --> 00:54:55,630 A group of people are walking in the park. 924 00:54:55,630 --> 00:54:58,550 Feel free to check out the paper and see more interesting 925 00:54:58,550 --> 00:54:59,383 results. 926 00:55:01,420 --> 00:55:05,587 Okay finally, we come to hardware for efficient training. 927 00:55:06,478 --> 00:55:08,929 How to we take advantage of the algorithms 928 00:55:08,929 --> 00:55:10,089 we just mentioned. 929 00:55:10,089 --> 00:55:14,060 For example, parallelism, mixed precision, how are 930 00:55:14,060 --> 00:55:16,630 the hardware designed to actually 931 00:55:16,630 --> 00:55:19,297 take advantage of such features. 932 00:55:21,958 --> 00:55:26,041 First GPUs, this is the Nvidia PASCAL GPU, GP100, 933 00:55:28,950 --> 00:55:31,367 which was released last year. 934 00:55:32,289 --> 00:55:35,789 So it supports up to 20 Teraflops on FP16. 935 00:55:38,048 --> 00:55:40,849 It has 16 gigabytes of high bandwidth memory. 936 00:55:40,849 --> 00:55:42,932 750 gigabytes per second. 937 00:55:46,060 --> 00:55:49,430 So remember, computation and memory bandwidth are 938 00:55:49,430 --> 00:55:53,350 the two factors determines your overall performance. 939 00:55:53,350 --> 00:55:57,041 Whichever is lower, it will suffer. 940 00:55:57,041 --> 00:56:01,124 So this is a really high bandwidth, 700 gigabytes 941 00:56:02,209 --> 00:56:06,376 compared with DDR3 is just 10 or 30 gigabytes per second. 942 00:56:08,189 --> 00:56:10,022 Consumes 300 Watts and 943 00:56:14,147 --> 00:56:17,278 it's done in 16 nanometer process 944 00:56:17,278 --> 00:56:20,945 and have a 160 gigabytes per second NV Link. 945 00:56:22,248 --> 00:56:25,048 So remember we have computation, we have memory, 946 00:56:25,048 --> 00:56:28,307 and the third thing is the communication. 947 00:56:28,307 --> 00:56:31,547 All three factors has to be balanced in order to 948 00:56:31,547 --> 00:56:33,797 achieve a good performance. 949 00:56:35,088 --> 00:56:39,171 So this is very powerful, but even more exciting, 950 00:56:40,558 --> 00:56:44,739 just about a month ago, Jensen released the newest 951 00:56:44,739 --> 00:56:48,077 architecture called the Volta GPUs. 952 00:56:48,077 --> 00:56:50,877 And let's see what is inside the Volta GPU. 953 00:56:50,877 --> 00:56:55,044 Just released less than a month ago, so it has 15 of 954 00:56:57,568 --> 00:57:01,651 FP32 teraflops and what is new here, there is 120 955 00:57:03,950 --> 00:57:08,128 Tensor T-OPS, so specifically designed for deep learning. 956 00:57:08,128 --> 00:57:11,207 And we'll later cover what is the tensor core. 957 00:57:11,207 --> 00:57:13,957 And what is this 120 coming from. 958 00:57:16,368 --> 00:57:19,699 And rather than 750 gigabytes per second, this 959 00:57:19,699 --> 00:57:24,499 year, the HBM2, they are using 900 gigabytes per second 960 00:57:24,499 --> 00:57:25,678 memory bandwidth. 961 00:57:25,678 --> 00:57:27,190 Very exciting. 962 00:57:27,190 --> 00:57:32,139 And 12 nanometer process has a die size of more than 800 963 00:57:32,139 --> 00:57:33,248 millimeters square. 964 00:57:33,248 --> 00:57:37,310 A really large chip and supported by 300 gigabytes per 965 00:57:37,310 --> 00:57:38,477 second NVLink. 966 00:57:40,931 --> 00:57:44,880 So what's new in Volta, the most interesting thing for us 967 00:57:44,880 --> 00:57:49,251 for deep learning, is this thing called Tensor Core. 968 00:57:49,251 --> 00:57:51,629 So what is a Tensor Core? 969 00:57:51,629 --> 00:57:56,200 Tensor Core is actually an instruction that can 970 00:57:56,200 --> 00:58:00,987 do the four by four matrix times a four by four matrix. 971 00:58:00,987 --> 00:58:05,429 The fused FMA stands Fused Multiplication and Add 972 00:58:05,429 --> 00:58:08,491 in this mixed precision operation. 973 00:58:08,491 --> 00:58:11,074 Just in one single clock cycle. 974 00:58:12,939 --> 00:58:15,698 So let's discern for a little bit what does this mean. 975 00:58:15,698 --> 00:58:19,865 So mixed precision is exactly as we mentioned in the last 976 00:58:20,699 --> 00:58:24,866 chapter, so we are having FP16 for the multiplication, 977 00:58:26,430 --> 00:58:30,430 but for accumulation, we are doing it with FP32. 978 00:58:31,928 --> 00:58:35,870 That's where the mixed precision comes from. 979 00:58:35,870 --> 00:58:38,657 So let's say how many operations, if it's four 980 00:58:38,657 --> 00:58:43,030 by four by four, it's 64 multiplications then just 981 00:58:43,030 --> 00:58:45,000 in one single cycle. 982 00:58:45,000 --> 00:58:48,920 That's 12x increase in the speedup of the Volta 983 00:58:48,920 --> 00:58:53,087 compared with the Pascal, which is released just less year. 984 00:58:55,099 --> 00:58:59,590 So this is the result for matrix multiplication on 985 00:58:59,590 --> 00:59:01,288 different sizes. 986 00:59:01,288 --> 00:59:05,455 The speedup of Volta over Pascal is roughly 3x faster 987 00:59:08,928 --> 00:59:11,845 doing these matrix multiplications. 988 00:59:13,368 --> 00:59:16,790 What we care more is not only matrix multiplication 989 00:59:16,790 --> 00:59:19,958 but actually running the deep neural nets. 990 00:59:19,958 --> 00:59:23,048 So both for training and for inference. 991 00:59:23,048 --> 00:59:26,630 And for training on ResNet-50, by taking advantage 992 00:59:26,630 --> 00:59:29,998 of this Tensor Core in this V100, 993 00:59:29,998 --> 00:59:33,581 it is 2.4x faster than the P100 using FP32. 994 00:59:38,887 --> 00:59:43,054 So on the right hand side, it compares the inference 995 00:59:43,899 --> 00:59:48,066 speedup, given a 7 microsecond latency requirement. 996 00:59:50,138 --> 00:59:53,910 What is the number of images per second it can process? 997 00:59:53,910 --> 00:59:56,459 It has a measurement of throughput. 998 00:59:56,459 --> 01:00:00,292 Again, the V100 over P100, by taking advantage 999 01:00:03,796 --> 01:00:07,796 of the Tensor Core, is 3.7 faster than the P100. 1000 01:00:13,887 --> 01:00:18,745 So this figure gives roughly an idea, what is a Tensor Core, 1001 01:00:18,745 --> 01:00:22,287 what is an integer unit, what is a floating point unit. 1002 01:00:22,287 --> 01:00:23,954 So this whole figure 1003 01:00:27,705 --> 01:00:28,872 is a single SM 1004 01:00:33,065 --> 01:00:35,004 stream multiprocessor. 1005 01:00:35,004 --> 01:00:39,495 So SM is partitioned into four processing blocks. 1006 01:00:39,495 --> 01:00:41,763 One, two, three, four, right? 1007 01:00:41,763 --> 01:00:45,846 And in each block there are eight FP64 cores here 1008 01:00:48,105 --> 01:00:52,105 and 16 FP32 and 16 INT32 cores here, units here. 1009 01:00:55,751 --> 01:01:00,353 And then there are two of the new mixed precision 1010 01:01:00,353 --> 01:01:04,520 Tensor cores specifically designed for deep learning. 1011 01:01:07,641 --> 01:01:10,684 And also there are the one warp scheduler, dispatch unit 1012 01:01:10,684 --> 01:01:13,513 and Register File, as before. 1013 01:01:13,513 --> 01:01:17,596 So what is new here is the Tensor core unit here. 1014 01:01:18,935 --> 01:01:23,102 So here is a figure comparing the recent generations of 1015 01:01:25,722 --> 01:01:27,639 Nvidia GPUs from Kepler 1016 01:01:29,164 --> 01:01:31,664 to Maxwell to Pascal to Volta. 1017 01:01:34,722 --> 01:01:37,425 We can see everything is keeping improving. 1018 01:01:37,425 --> 01:01:40,733 For example, the boost clock has been increased from 1019 01:01:40,733 --> 01:01:42,816 about 800 MHz to 1.4 GHz. 1020 01:01:46,563 --> 01:01:50,730 And from the Volta generation there begins to have 1021 01:01:52,855 --> 01:01:57,022 the Tensor core units here, which has never existed before. 1022 01:01:59,241 --> 01:02:01,158 And before the Maxwell, 1023 01:02:02,364 --> 01:02:04,781 the GPUs are using the GDDR5, 1024 01:02:07,924 --> 01:02:10,662 and after the Pascal GPU, 1025 01:02:10,662 --> 01:02:12,993 the HBM begins to came into place, 1026 01:02:12,993 --> 01:02:14,593 the high-bandwidth memory. 1027 01:02:14,593 --> 01:02:17,093 750 gigabytes per second here. 1028 01:02:18,543 --> 01:02:22,804 900 gigabytes per second compared with DDR3, 1029 01:02:22,804 --> 01:02:24,804 30 gigabytes per second. 1030 01:02:27,364 --> 01:02:31,531 And memory size actually didn't increase by too much, 1031 01:02:34,204 --> 01:02:36,593 and the power consumption is actually 1032 01:02:36,593 --> 01:02:38,783 also remaining roughly the same. 1033 01:02:38,783 --> 01:02:41,844 But giving the increase of computation, you can fit them 1034 01:02:41,844 --> 01:02:46,712 in the fixed power envelope that's still an exciting thing. 1035 01:02:46,712 --> 01:02:49,433 And the manufacturing process is actually improving from 1036 01:02:49,433 --> 01:02:53,600 28 nanometer, 16 nanometer, all the way to 12 nanometer. 1037 01:02:55,295 --> 01:02:58,033 And the chip area are also increasing to 1038 01:02:58,033 --> 01:03:01,616 800 millimeter-squared, that's really huge. 1039 01:03:03,084 --> 01:03:07,513 So, you may be interested in the comparison of the GPU 1040 01:03:07,513 --> 01:03:09,663 with the TPU, right? 1041 01:03:09,663 --> 01:03:12,463 So how do they compare with each other? 1042 01:03:12,463 --> 01:03:15,023 So in the original TPU paper, 1043 01:03:15,023 --> 01:03:18,797 TPU actually designed roughly in the year of 2015, 1044 01:03:18,797 --> 01:03:22,464 and this is comparison of the Pascal P40 GPU 1045 01:03:23,673 --> 01:03:25,090 released in 2016. 1046 01:03:27,815 --> 01:03:30,924 So, TPU, the power consumption is lower, 1047 01:03:30,924 --> 01:03:34,273 is larger on chip memory of 24 megabytes, 1048 01:03:34,273 --> 01:03:38,015 really large on-chip SRAM managed by the software. 1049 01:03:38,015 --> 01:03:42,593 And then both of them support INT8 operations, 1050 01:03:42,593 --> 01:03:46,760 while the inferences per second given a 10 nanometer latency 1051 01:03:47,764 --> 01:03:50,484 the comparison for TPU is 1X. 1052 01:03:50,484 --> 01:03:52,651 For the P40 it's about 2X. 1053 01:03:57,975 --> 01:03:59,558 So, just last week, 1054 01:04:01,682 --> 01:04:03,655 in the Google I/O, 1055 01:04:03,655 --> 01:04:06,421 a new nuclear bomb is landed on the Earth. 1056 01:04:06,421 --> 01:04:09,251 That is the Google Cloud TPU. 1057 01:04:09,251 --> 01:04:13,203 So now TPU not only support inference, 1058 01:04:13,203 --> 01:04:15,353 but also support training. 1059 01:04:15,353 --> 01:04:18,622 So there is a very limited information we can get 1060 01:04:18,622 --> 01:04:20,873 beyond this Google Blog. 1061 01:04:20,873 --> 01:04:24,790 So their Cloud TPU delivers up to 180 teraflops 1062 01:04:28,713 --> 01:04:32,130 to train and run machine learning models. 1063 01:04:33,422 --> 01:04:36,820 And this is multiple Cloud TPU, 1064 01:04:36,820 --> 01:04:38,903 making it into a TPU pod, 1065 01:04:40,110 --> 01:04:44,963 which is built with 16 the second generation TPUs 1066 01:04:44,963 --> 01:04:48,542 and delivers up to 11.5 teraflops 1067 01:04:48,542 --> 01:04:50,873 of machine learning acceleration. 1068 01:04:50,873 --> 01:04:53,862 So in the Google Blog, they mentioned that 1069 01:04:53,862 --> 01:04:56,420 one of the large scale translation models, 1070 01:04:56,420 --> 01:05:00,881 Google translation models, used to take a full day to train 1071 01:05:00,881 --> 01:05:05,048 on 32 of best commercially-available GPUs, probably P40 1072 01:05:06,731 --> 01:05:07,981 or P100, maybe. 1073 01:05:08,902 --> 01:05:11,380 And now it trains to the same accuracy, 1074 01:05:11,380 --> 01:05:15,547 just within one afternoon, with just 1/8 of a TPU pod, 1075 01:05:17,523 --> 01:05:19,606 which is pretty exciting. 1076 01:05:22,611 --> 01:05:25,273 Okay, so as a little wrap-up. 1077 01:05:25,273 --> 01:05:27,662 We covered a lot of stuff, we've mentioned 1078 01:05:27,662 --> 01:05:30,763 the four dimension space of algorithm and hardware, 1079 01:05:30,763 --> 01:05:33,993 inference and training, we covered the algorithms for 1080 01:05:33,993 --> 01:05:36,982 inference, for example, pruning and quantization, 1081 01:05:36,982 --> 01:05:40,251 Winograd Convolution, binary, ternary, 1082 01:05:40,251 --> 01:05:42,174 weight sharing, for example. 1083 01:05:42,174 --> 01:05:44,603 And then the hardware for the efficient inference. 1084 01:05:44,603 --> 01:05:46,353 For example, the TPU, 1085 01:05:48,665 --> 01:05:52,523 that take advantage of INT8, integer 8. 1086 01:05:52,523 --> 01:05:56,464 And also my design of EIE accelerator that take advantage 1087 01:05:56,464 --> 01:05:59,951 of the sparsity, anything multiplied by zero is zero, 1088 01:05:59,951 --> 01:06:03,201 so don't store it, don't compute on it. 1089 01:06:04,260 --> 01:06:07,131 And also the efficient algorithm for training, for example, 1090 01:06:07,131 --> 01:06:11,312 how do we do parallelization and the most recent research on 1091 01:06:11,312 --> 01:06:14,901 how do we use mixed precision training by taking advantage 1092 01:06:14,901 --> 01:06:18,151 of FP16 rather than FP32 to do training 1093 01:06:19,131 --> 01:06:22,131 which is four times saving the energy 1094 01:06:22,131 --> 01:06:23,939 and four times saving in the area, 1095 01:06:23,939 --> 01:06:27,731 which doesn't quite sacrifice the accuracy you'll get from 1096 01:06:27,731 --> 01:06:28,814 the training. 1097 01:06:31,803 --> 01:06:35,352 And also Dense-Sparse-Dense training using better regularization 1098 01:06:35,352 --> 01:06:39,519 sparse regularization, and also the teacher-student model. 1099 01:06:41,021 --> 01:06:43,741 You have multiple teacher on your network and have a small 1100 01:06:43,741 --> 01:06:46,461 student network that you can distill the knowledge 1101 01:06:46,461 --> 01:06:51,072 from the teacher in your network by a temperature. 1102 01:06:51,072 --> 01:06:54,650 And finally we covered the hardware for efficient training 1103 01:06:54,650 --> 01:06:57,580 and introduced two nuclear bombs. 1104 01:06:57,580 --> 01:07:01,747 One is the Volta GPU, the other is the TPU version two, 1105 01:07:02,590 --> 01:07:06,507 the Cloud TPU and also the amazing Tensor cores 1106 01:07:09,184 --> 01:07:12,771 in the newest generation of Nvidia GPUs. 1107 01:07:12,771 --> 01:07:16,632 And we also revealed the progression of a wide range, 1108 01:07:16,632 --> 01:07:20,861 the recent Nvidia GPUs from the Kepler K40, 1109 01:07:20,861 --> 01:07:23,461 that's actually when I started my research, 1110 01:07:23,461 --> 01:07:25,283 what we used in the beginning, 1111 01:07:25,283 --> 01:07:28,033 all the way to and then K40, M40, 1112 01:07:29,437 --> 01:07:33,213 and then Pascal and then finally the exciting Volta GPU. 1113 01:07:33,213 --> 01:07:37,380 So every year there is a nuclear bomb in the spring. 1114 01:07:40,981 --> 01:07:44,992 Okay, a little look ahead in the future. 1115 01:07:44,992 --> 01:07:47,381 So in the future of the city we can imagine there are a lot 1116 01:07:47,381 --> 01:07:52,301 of AI applications using smart society, smart care, 1117 01:07:52,301 --> 01:07:56,504 IOT devices, smart retail, for example, the Amazon Go, 1118 01:07:56,504 --> 01:07:59,984 and also smart home, a lot of scenarios. 1119 01:07:59,984 --> 01:08:03,995 And it poses a lot of challenges on the hardware design 1120 01:08:03,995 --> 01:08:07,851 that requires the low latency, privacy, mobility 1121 01:08:07,851 --> 01:08:09,355 and energy efficiency. 1122 01:08:09,355 --> 01:08:12,202 You don't want your battery to drain very quickly. 1123 01:08:12,202 --> 01:08:15,155 So it's both challenging and very exciting era 1124 01:08:15,155 --> 01:08:18,904 for the code design for both the machine learning 1125 01:08:18,904 --> 01:08:20,595 deep neural network model architectures 1126 01:08:20,595 --> 01:08:23,283 and also the hardware architecture. 1127 01:08:23,283 --> 01:08:26,773 So we have moved from PC era to mobile era. 1128 01:08:26,773 --> 01:08:29,973 Now we are in the AI-First era, 1129 01:08:29,973 --> 01:08:32,818 and hope you are as excited as I am for this kind of 1130 01:08:32,818 --> 01:08:36,485 brain-inspired cognitive computing research. 1131 01:08:37,773 --> 01:08:41,962 Thank you for your attention, I'm glad to take questions. 1132 01:08:41,962 --> 01:08:44,212 [applause] 1133 01:08:50,875 --> 01:08:52,625 We have five minutes. 1134 01:08:54,323 --> 01:08:55,643 Of course. 1135 01:08:55,643 --> 01:08:59,504 - [Student] Can you commercialize the deep architecture? 1136 01:08:59,504 --> 01:09:04,122 - The architecture, yeah, some of the ideas are pretty good. 1137 01:09:04,122 --> 01:09:06,583 I think there's opportunity. 1138 01:09:06,584 --> 01:09:07,417 Yeah. 1139 01:09:11,841 --> 01:09:12,674 Yeah. 1140 01:09:30,091 --> 01:09:34,258 The question is, what can we do to make the hardware better? 1141 01:09:46,997 --> 01:09:48,979 Oh, right, the question is how do we, 1142 01:09:48,979 --> 01:09:51,917 the challenges and what opportunity for those small 1143 01:09:51,917 --> 01:09:54,699 embedded devices around deep neural network 1144 01:09:54,699 --> 01:09:57,006 or in general AI algorithms. 1145 01:09:57,006 --> 01:10:00,673 Yeah, so those are the algorithm I discussed 1146 01:10:02,197 --> 01:10:04,947 in the beginning about inference. 1147 01:10:06,309 --> 01:10:07,142 Here. 1148 01:10:08,579 --> 01:10:12,448 These are the techniques that can enable such 1149 01:10:12,448 --> 01:10:15,107 inference or AI running on embedded devices, 1150 01:10:15,107 --> 01:10:18,448 by having less number of weights, fewer bits per weight, 1151 01:10:18,448 --> 01:10:20,648 and also quantization, low rank approximation. 1152 01:10:20,648 --> 01:10:24,397 The small matrix, same accuracy, even going to binary, 1153 01:10:24,397 --> 01:10:27,808 or ternary weights having just two bits 1154 01:10:27,808 --> 01:10:31,288 to do the computation rather than 16 or even 32 bit 1155 01:10:31,288 --> 01:10:33,745 and also the Winograd Transformation. 1156 01:10:33,745 --> 01:10:36,456 Those are also the enabling algorithms for those 1157 01:10:36,456 --> 01:10:38,706 low-power embedded devices. 1158 01:10:57,356 --> 01:11:02,189 Okay, the question is, if it's binary weight, the software 1159 01:11:02,189 --> 01:11:06,356 developers may be not able to take advantage of it. 1160 01:11:07,509 --> 01:11:11,418 There is a way to take advantage of binary weight. 1161 01:11:11,418 --> 01:11:14,418 So in one register there are 32 bit. 1162 01:11:16,538 --> 01:11:19,827 Now you can think of it as a 32-way parallelism. 1163 01:11:19,827 --> 01:11:22,457 Each bit is a single operation. 1164 01:11:22,457 --> 01:11:25,120 So say previously we have 10 ops per second. 1165 01:11:25,120 --> 01:11:27,703 Now you get 330 ops per second. 1166 01:11:31,000 --> 01:11:33,917 You can do this bitwise operations. 1167 01:11:34,960 --> 01:11:37,287 For example, XOR operations. 1168 01:11:37,287 --> 01:11:39,368 So one register file, 1169 01:11:39,368 --> 01:11:42,285 one operation becomes 32 operation. 1170 01:11:43,608 --> 01:11:47,058 So there is a paper called XORmad, 1171 01:11:47,058 --> 01:11:49,845 they very amazing implemented 1172 01:11:49,845 --> 01:11:52,637 on the Raspberry Pi using this feature 1173 01:11:52,637 --> 01:11:55,907 to do real-time detection, very cool stuff. 1174 01:11:55,907 --> 01:11:56,740 Yeah. 1175 01:12:11,779 --> 01:12:15,946 Yeah, so the trade-off is always so the power area 1176 01:12:16,956 --> 01:12:19,819 and performance in general, all the hardware design 1177 01:12:19,819 --> 01:12:23,298 have to take into account the performance, the power, 1178 01:12:23,298 --> 01:12:24,798 and also the area. 1179 01:12:26,158 --> 01:12:29,387 When machine learning comes, there's a fourth 1180 01:12:29,387 --> 01:12:32,107 figure of merit which is the accuracy. 1181 01:12:32,107 --> 01:12:34,089 What is the accuracy? 1182 01:12:34,089 --> 01:12:37,019 And there is a fifth one which is programmability. 1183 01:12:37,019 --> 01:12:39,089 So how general is your hardware? 1184 01:12:39,089 --> 01:12:42,089 For example, if Google just want to use that for AI 1185 01:12:42,089 --> 01:12:45,507 and deep learning, it's totally fine 1186 01:12:45,507 --> 01:12:48,635 that we can have a fully very specialized architecture 1187 01:12:48,635 --> 01:12:51,206 just for deep learning to support convolution, 1188 01:12:51,206 --> 01:12:54,307 multi-layered perception, long-short-term memory, 1189 01:12:54,307 --> 01:12:58,224 but GPUS, you also want to have support for those 1190 01:13:00,067 --> 01:13:03,734 scientific computing or graphics, AR and VR. 1191 01:13:04,915 --> 01:13:07,998 So that's a difference, first of all. 1192 01:13:10,804 --> 01:13:14,244 And TPU basically is a ASIC, right? 1193 01:13:14,244 --> 01:13:16,987 It's a very fixed function but you can still program it 1194 01:13:16,987 --> 01:13:21,587 with those coarse instructions so people from Google 1195 01:13:21,587 --> 01:13:24,755 roughly designed those coarse granularity instruction. 1196 01:13:24,755 --> 01:13:27,467 For example, one instruction just load the matrix, 1197 01:13:27,467 --> 01:13:29,795 store a matrix, do convolutions, 1198 01:13:29,795 --> 01:13:31,507 do matrix multiplications. 1199 01:13:31,507 --> 01:13:34,377 Those coarse-grain instructions 1200 01:13:34,377 --> 01:13:37,710 and they have a software-managed memory, 1201 01:13:38,605 --> 01:13:40,558 also called a scratchpad. 1202 01:13:40,558 --> 01:13:43,885 It's different from cache where it determines 1203 01:13:43,885 --> 01:13:47,217 where to evict something from the cache, but now, 1204 01:13:47,217 --> 01:13:49,845 since you know the computation pattern, 1205 01:13:49,845 --> 01:13:53,512 there's no need to do out-of-order execution, 1206 01:13:54,446 --> 01:13:57,066 to do branch prediction, no such things. 1207 01:13:57,066 --> 01:14:00,255 Everything is determined, so you can take the multi of 1208 01:14:00,255 --> 01:14:04,422 it and maintain a fully software-managed scratchpad 1209 01:14:05,337 --> 01:14:09,897 to reduce the data movement and remember, data movement 1210 01:14:09,897 --> 01:14:13,084 is the key for reducing the memory footprint 1211 01:14:13,084 --> 01:14:14,606 and energy consumption. 1212 01:14:14,606 --> 01:14:15,439 So, yeah. 1213 01:14:26,633 --> 01:14:30,313 Mobilia and Nobana architectures actually I'm not quite 1214 01:14:30,313 --> 01:14:33,813 familiar, didn't prepare those slides, so, 1215 01:14:34,736 --> 01:14:37,569 comment it a little bit later, no. 1216 01:14:52,428 --> 01:14:54,507 Oh, yeah, of course. 1217 01:14:54,507 --> 01:14:57,778 Those are always and can certainly be applied 1218 01:14:57,778 --> 01:15:00,269 to low-power embedded devices. 1219 01:15:00,269 --> 01:15:03,686 If you're interested, I can show you a... 1220 01:15:04,629 --> 01:15:05,462 Whoops. 1221 01:15:06,971 --> 01:15:08,888 Some examples of, oops. 1222 01:15:10,689 --> 01:15:11,859 Where is that? 1223 01:15:11,859 --> 01:15:15,731 Of my previous projects running deep neural nets. 1224 01:15:15,731 --> 01:15:19,394 For example, on a drone, this is using a Nvidia TK1 1225 01:15:19,394 --> 01:15:23,561 mobile GPU to do real-time tracking and detection. 1226 01:15:26,691 --> 01:15:28,898 This is me playing my nunchaku. 1227 01:15:28,898 --> 01:15:32,898 Filmed by a drone to do the detection and tracking. 1228 01:15:34,672 --> 01:15:38,939 And also, this FPGA doing the deep neural network. 1229 01:15:38,939 --> 01:15:41,039 It's pretty small. 1230 01:15:41,039 --> 01:15:44,611 This large, doing the face-alignment and 1231 01:15:44,611 --> 01:15:48,194 detecting the eyes, the nose and the mouth, 1232 01:15:49,352 --> 01:15:51,602 at a pretty high framerate. 1233 01:15:53,151 --> 01:15:55,401 Consuming only three watts. 1234 01:15:56,918 --> 01:16:00,689 This is a project I did at Facebook doing the 1235 01:16:00,689 --> 01:16:03,269 deep neural nets on the mobile phone to do 1236 01:16:03,269 --> 01:16:06,781 image classification, for example, it says it's a laptop, 1237 01:16:06,781 --> 01:16:10,389 or you can feed it with an image and it says 1238 01:16:10,389 --> 01:16:14,480 it's a selfie, has person and the face, et cetera. 1239 01:16:14,480 --> 01:16:17,621 So there's lots of opportunity for those 1240 01:16:17,621 --> 01:16:21,788 embedded or mobile-deployment of deep neural nets. 1241 01:16:30,419 --> 01:16:32,288 No, there is a team doing that, 1242 01:16:32,288 --> 01:16:34,808 but I cannot comment too much, probably. 1243 01:16:34,808 --> 01:16:38,975 There is a team at Google doing that sort of stuff, yeah. 1244 01:16:44,876 --> 01:16:46,208 Okay, thanks, everyone. 1245 01:16:46,208 --> 00:00:00,000 If you have any questions, feel free to drop me a e-mail.